-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store the dimension order when using Zarr #9924
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
for Can you give us a bit more context on why you'd need to keep the dataset dimension order? |
For consistency when converting to and from Pandas (and from there to and from CSV). And how lack of consistency will affect adoption of using xarray in my team. To and from Pandas is where I originally noticed the inconsistency in the order. Much of my team members are scientists that are not going to want to deal with xarray/Zarr directly. They have legacy code that works with an ad hoc text format (sorta CSV-like, but not quite) they've dealt with previously. These members would be perfectly happy working with CSV data through. However, it's often useful for some members of the team to have larger-than-memory and distributed processing (the ones who would be happy to work with xarray/Zarr directly). I'm looking to switch to xarray/Zarr to make this part easier for those who are involved in that component of the work. But still easily import and export subsets of the data back to the CSV other members of the team will use. The loss of column order from/to the CSVs makes switching to xarray more difficult to justify. One solution would be to add my own saving and loading of the order. But then any of the team members who will work with the xarray/Zarr data directly will also need to make sure to perform the same steps. This is doable, but an extra hurdle to justify switching to xarray. If the ordering of this was automatically consistent through xarray, that would remove this issue. |
If you're using |
Thanks, but unfortunately, the columns/dimensions in our data are not fixed. For a given experiment, it's consistent across all the data from that experiment. But different experiments will have different columns/dimensions, so I'm not able to simply define a consistent order somewhere. It needs to be inferred from the data files. That is, one experiment might produce a collection of CSV(-like) files that all have the same columns. Here, is where xarray would come in to store the larger results in Zarr, process the data, and at some point spit back out some CSVs with the same columns. But then a different experiment would have different columns, but would still need similar processing. |
The ordering of With that in mind, the actual problem here appears to be that We could change |
However, user code that does know that all dims are strings can just do (use ds.to_dataframe(dim_order=sorted(ds.dims)) which will give you a consistent (sorted) order, even if you don't know the actual dimensions. |
Narrowing down to my more concrete situation (e.g., to and from Pandas), I just want to clarify that the issue only occurs after saving and reloading from Zarr. Just converting to and from Pandas and xarray alone does not encounter the issue I'm running into. For example: import pandas as pd
import xarray as xr
from pathlib import Path
original_pandas_data_frame = pd.DataFrame({'b': [0, 1], 'a': [2, 3]})
original_xarray_dataset = original_pandas_data_frame.to_xarray()
pandas_data_frame_from_plain_xarray = original_xarray_dataset.to_pandas()
zarr_path = Path('example_dataset.zarr')
original_xarray_dataset.to_zarr(zarr_path, mode='w')
zarr_loaded_xarray_dataset = xr.open_zarr(zarr_path)
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.to_pandas()
print(original_pandas_data_frame.columns)
print(pandas_data_frame_from_plain_xarray.columns)
print(pandas_data_frame_from_zarr_loaded_xarray.columns) outputs:
(This might have already been clear, but I just wanted to make sure) |
I got that, thanks. The dimension order of a dataset depends on the order in which it is seen on the variables from which the dataset is constructed. The Either way, if you rely on the order somehow it is good practice to explicitly define that order somewhere. |
For my case then, I suppose I will try to store the dataset dimensions in the Zarr group |
What I was thinking of was something like this: def to_ordered_dataframe(ds):
# use `sorted(ds.dims, key=lambda k: ...)` for a sorting other than alphabetical
return ds.to_dataframe(dim_order=sorted(ds.dims))
...
pandas_data_frame_from_plain_xarray = original_xarray_dataset.pipe(to_ordered_dataframe)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
... If you want to additionally store the expected order for exact roundtripping, I would define a wrapper around def persist_dim_order(ds):
return ds.assign_attrs(dim_order=list(ds.dims))
def to_ordered_dataframe(ds):
dim_order = ds.attrs["dim_order"]
return ds.to_dataframe(dim_order=dim_order)
...
original_xarray_dataset = original_pandas_data_frame.to_xarray().pipe(persist_dim_order)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
... you could even put that into an accessor: @xr.register_dataset_accessor("ordered_df")
class OrderedDFAccessor:
def __init__(self, ds):
self._ds = ds
def persist_dim_order(self):
return self._ds.assign_attrs(dim_order=list(self._ds.dims))
def to_dataframe(self):
dim_order = self._ds.attrs["dim_order"]
return self._ds.to_dataframe(dim_order=dim_order)
...
original_xarray_dataset = original_pandas_data_frame.to_xarray().ordered_df.persist_dim_order()
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.ordered_df.to_dataframe()
... |
Is your feature request related to a problem?
Currently, if a dataset is saved to Zarr and then reloaded, the dimensions are reloaded alphabetically.
Example:
outputs
Describe the solution you'd like
Store the dimension order when using Zarr, such that when the dataset is reloaded from the Zarr file, the original dimension order is maintained.
Describe alternatives you've considered
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: