-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defining blocks w/o using dask #13
Comments
Do you think I guess the other extreme would be to make a fully lazy "schema" for an xarray.Dataset, which could include chunking as well. This is important for cases like writing a new Zarr file. So far I've resisted this in favor of using actual |
Formalizing a small set theory for index space would be a good foundation IMO. I have something like the following in mind
It’s a different point, but I’m +1 on some kind of DatasetSchema. It can be awkward to make dask work for this purpose. |
The current I think like @shoyer, this dependence on dask-backed DataArrays was primarily out of convenience. It is convenient because dask gives us an easy way to lazily construct a Dataset, regardless of how the data is stored on disk (i.e. xpartition does not need to know that our source data is split across multiple netCDF files; xarray and dask can handle that). This worked well for the use-case I initially wrote things for -- the dask graph wasn't too large to be inconvenient, but did involve enough data, 30+ TB in some cases, to cause problems if you tried to execute things through a single dask client. That said, I totally get the fact that it can be easy to create dask graphs that are themselves difficult to work with, and so I would be very supportive of working towards a way to define blocks without dask. For the use-case of |
I suppose if we have the tuple version of
|
For sure. My thoughts are that we should make an abstract interface, that can be implemented with either chunks tuples or sizes or w/e. Also, "chunking" is just a special kind of partition, one that is orthogonal. |
So far I've mostly been sticking to "non-expanded" chunks in public APIs for xarray-beam, e.g., just I do still need irregular chunks when splitting for rechunking, but they don't have any particular representation in the public API, other than the "intersection of source and target chunks". You can figure out things like number of chunks in the intersection and how to split chunks for rechunking with a bit of math, e.g.,
Do you have plans for non-orthogonal partitions? 😱 |
I think spencer's (and my) point was that with "non-expanded" chunks you also need sizes to know how big the global index space is. Or you can explicitly list all the tuples.
I may be wrong, but xpartition may already support this. It combines multiple chunks into larger rectangles based on the number of "ranks" the user selects, and I don't think we can assume that all the edges of these line up any longer...imagine a large cube embedded a sea of smaller cubes. this is quite a nice way to allocate work when the individual chunks are too small. |
I had a similar reaction to @shoyer regarding non-orthogonal partitions :). Just to make sure I am on the same page -- by "orthogonal" partition do we mean a region that can be completely defined with a single The algorithm for splitting arrays across N ranks is to iteratively go dimension by dimension in the "blocks" space, splitting each dimension into as many partitions as possible such that the total number of partitions remains less than or equal to the number of available ranks (here with respect to the underlying data, the partitions are "chunks of chunks," or "meta-chunks" if you will). Sometimes not all ranks will be used. |
Ah I see. It goes one dimension at a time. Just to straighten up terminology and make it line up with set theory:
This is described using the types in my comment above: #13 (comment). |
For instance, in my picture above, there is no good way to coarsen the lattice into another lattice with 4 ranks. One worker will have to do C11, C21, C12, C22, another will do C33, and the remaining two will do C31/C32 and C13,C23. |
Right, I can see how non-orthogonal partitioning could potentially be useful, but it's complex enough that I definitely would not bother to try to support it until there is a clear use-case :). I thought I would mention two other nuances around data structures that have come up for me in xarray-beam:
Interestingly the current functionality of xpartition and xarray-beam is actually almost entirely orthogonal, aside from the ability to partition a dataset (although even this is done is fairly different ways). To me this is an argument for focusing on a handful of core data-structures and defining a library of helper functions that act on those core data structures, something that could underlie the current functionality of both xpartition and xarray-beam. |
The main reason I mention it is that xpartition provides a |
It's also a bit redundant with any coordinate info inside of the xarray objects. Presumably if all the dimensions of an xarray object had coordinate info, we wouldn't need to pass around any separate metadata. |
This is true, but I didn't want to make any assumptions about the existing or sorted nature of coordinate values. |
Xarray offers a limited lazy array functionality for indexing and merging that does not require dask. This can be achieved by using
xarray.open_zarr(, chunks=None)
.Currently xpartition requires dask arrays, but it seems like the we could decouple the notion of chunks from xarray/dask by allowing the user to specific the chunks/sizes in
xpartition/xpartition.py
Line 116 in 7a8bee5
The could then use the basic feature to plan jobs e.g.:
The text was updated successfully, but these errors were encountered: