Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failing with Dask 2024.12.0 #345

Open
CSSFrancis opened this issue Dec 12, 2024 · 10 comments
Open

CI Failing with Dask 2024.12.0 #345

CSSFrancis opened this issue Dec 12, 2024 · 10 comments
Labels
type: bug Something isn't working

Comments

@CSSFrancis
Copy link
Member

Describe the bug

Test passes with dask - 2024.11.2 and fails with dask 2024-12.0

To Reproduce

rsciio/tests/test_blockfile.py:438 (test_load_readonly)
def test_load_readonly():
        s = hs.load(FILE2, lazy=True)
        k = next(
            filter(
                # The or statement with both "array-original" and "original-array"
                # is due to dask changing the name of this key. After dask-2022.1.1
                # the key is "original-array", before it is "array-original"
                lambda x: isinstance(x, str)
                and (x.startswith("original-array") or x.startswith("array-original")),
                s.data.dask.keys(),
            )
        )
        mm = s.data.dask[k]
        assert isinstance(mm, np.memmap)
>       assert not mm.flags["WRITEABLE"]
E       assert not True

test_blockfile.py:453: AssertionError

Additional context

Will look upstream to determine what has changed.

@CSSFrancis CSSFrancis added the type: bug Something isn't working label Dec 12, 2024
@CSSFrancis
Copy link
Member Author

CSSFrancis commented Dec 12, 2024

Failure from Commit dask/dask@511b8af

dask/dask#11524

@CSSFrancis
Copy link
Member Author

CSSFrancis commented Dec 12, 2024

@ericpre do you have any idea if this loads things into memory? It seems like it doesn't but the change in the array flags is slightly concering.

import numpy as np
# Create a memmap array backed by a file
filename = 'my_array.dat'
dtype = np.float32
shape = (100, 100)

# Create the memmap
mmap_array = np.memmap(filename, dtype=dtype, mode='r', shape=shape)

m2 = mmap_array.copy() 

m2 # mmap array still

print(mmap_array.flags)

#   C_CONTIGUOUS : True
#  F_CONTIGUOUS : False
#  OWNDATA : False
#  WRITEABLE : False
#  ALIGNED : True
#  WRITEBACKIFCOPY : False

print(m2.flags)

#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : False
#  OWNDATA : True
#  WRITEABLE : True
#  ALIGNED : True
#  WRITEBACKIFCOPY : False

It doesn't seem to from some tests, although copying does appear to change the array to being writeable....

@ericpre
Copy link
Member

ericpre commented Dec 12, 2024

Sorry no idea...

@CSSFrancis
Copy link
Member Author

Okay from my testing this is different but shouldn't be detrimental. You won't be able to write to the original array, but that is already the prefered behavior. There isn't a huge overhead for the copy so we can probably just remove the check.

@ericpre
Copy link
Member

ericpre commented Dec 13, 2024

Sounds good, is there actually a simple way to test the behaviour (not possible to write?), maybe check that it raises an error?

I am wondering, if this is still the case that the data can't be written? Could it be related to the fact that previously, dask array were not mutable but this has changed since the blockfile code has been written?

@CSSFrancis
Copy link
Member Author

Sounds good, is there actually a simple way to test the behaviour (not possible to write?), maybe check that it raises an error?

So if we take this little bit of code:

import numpy as np
# Create a memmap array backed by a file
filename = 'my_array.dat'
dtype = np.float32
shape = (100, 100)

# Create the memmap
mmap_array = np.memmap(filename, dtype=dtype, mode='w+', shape=shape)

# read the memmap
mmap_array2 = np.memmap(filename, dtype=dtype, mode='r', shape=shape)
#copy the read only memmap
m2 = mmap_array2.copy() 

m2[:]=1 # This is possible now (should it be???) Maybe??
m2.flush() # This does nothing from what I can tell.....
print(np.all(mmap_array2 != m2)) # Arrays are not equal (True)

I'm not sure what the intended behavior here should be. In any case this is something that should be determined in numpy. I've been meaning to make a issue but I'm a little unsure of what the behavior should be which is why I'm slightly hesistant.

@sem-geologist
Copy link
Contributor

sem-geologist commented Dec 18, 2024

I'm not sure what the intended behavior here should be. In any case this is something that should be determined in numpy. I've been meaning to make a issue but I'm a little unsure of what the behavior should be which is why I'm slightly hesistant.

help(np.copy)

documents the behaviour, scrolling a abit you will find this note:

Note that, np.copy clears previously set WRITEABLE=False flag.

    >>> a = np.array([1, 2, 3])
    >>> a.flags["WRITEABLE"] = False
    >>> b = np.copy(a)
    >>> b.flags["WRITEABLE"]
    True

Which is absolutely logical behaviour and sensible defaults. Please note that m2 is a copy (in memory) and is no more memory mapped to disk!
Changing values in m2 does not change values in mmap_array2, where changing vlaues in mmap_array can be seen in mmap_array2. thats what array copy is for: to not accidentally destroy raw data.

@sem-geologist
Copy link
Contributor

sem-geologist commented Dec 18, 2024

@ericpre do you have any idea if this loads things into memory? It seems like it doesn't but the change in the array flags is slightly concering.

import numpy as np
# Create a memmap array backed by a file
filename = 'my_array.dat'
dtype = np.float32
shape = (100, 100)

# Create the memmap
mmap_array = np.memmap(filename, dtype=dtype, mode='r', shape=shape)

m2 = mmap_array.copy() 

m2 # mmap array still                                            <<-- NO, it is not!

print(mmap_array.flags)

#   C_CONTIGUOUS : True
#  F_CONTIGUOUS : False
#  OWNDATA : False
#  WRITEABLE : False
#  ALIGNED : True
#  WRITEBACKIFCOPY : False

print(m2.flags)

#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : False
#  OWNDATA : True
#  WRITEABLE : True
#  ALIGNED : True
#  WRITEBACKIFCOPY : False

It doesn't seem to from some tests, although copying does appear to change the array to being writeable....

not only WRITEABLE flag is changed on copy, OWNDATA flag is changed too (thus m2 is not mmap, as it has its own data and not points to previous). shallowness of copy works only on python objects (will point to same python objects) , numeric arrays, however, will be copied.

@CSSFrancis
Copy link
Member Author

@sem-geologist I guess I might have not done exactly the right test. It is still a np.memmap object I guess but (maybe?) shouldn't be.

For example:

import numpy as np
# Create a memmap array backed by a file
filename = 'my_array.dat'
dtype = np.float32
shape = (100, 100)
# Create the memmap
mmap_array = np.memmap(filename, dtype=dtype, mode='w+', shape=shape)

# Read the memmap
mmap_array2 = np.memmap(filename, dtype=dtype, mode='r', shape=shape)
arrs = [mmap_array2.copy() for i in range(1000000)]

Doesn't crash my computer but it does say that my memory useage is 47 GB (far more than the 16 GB it actually has) so something must be happening behind the scenes.

The thing that really matters here is if this results in a large change in the memory usage when pairing np.memmap with dask.

@ericpre
Copy link
Member

ericpre commented Dec 18, 2024

In the discussion on the dask PR, it is suggested that from_array should be avoided if the copy is not desirable. Maybe an alternative here is use the "distributed" approach implemented in some other formats?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants