Store the dimension order when using Zarr #9924

golmschenk · 2025-01-04T04:29:16Z

Is your feature request related to a problem?

Currently, if a dataset is saved to Zarr and then reloaded, the dimensions are reloaded alphabetically.

Example:

import xarray as xr
import numpy as np
from pathlib import Path

original_dataset = xr.Dataset(
    {
        'temperature': (('time', 'lat', 'lon'), np.random.rand(5, 4, 3)),
        'precipitation': (('time', 'lat', 'lon'), np.random.rand(5, 4, 3))
    },
    coords={
        'time': np.arange(5),
        'lat': np.linspace(-90, 90, 4),
        'lon': np.linspace(0, 360, 3)
    }
)

print('Original Dimensions:', list(original_dataset.dims))
zarr_path = Path('example_dataset.zarr')
original_dataset.to_zarr(zarr_path, mode='w')
reloaded_dataset = xr.open_zarr(zarr_path)
print('Reloaded Dimensions:', list(reloaded_dataset.dims))

outputs

Original Dimensions: ['time', 'lat', 'lon']
Reloaded Dimensions: ['lat', 'lon', 'time']

Describe the solution you'd like

Store the dimension order when using Zarr, such that when the dataset is reloaded from the Zarr file, the original dimension order is maintained.

Describe alternatives you've considered

N/A

Additional context

N/A

The text was updated successfully, but these errors were encountered:

welcome · 2025-01-04T04:29:20Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

keewis · 2025-01-04T16:23:24Z

for xarray (and zarr, I believe) the dimension order as listed in the dataset dimensions changes very little: as far as I can tell, the only instances are the string and HTML reprs and the default values for to_dataframe (see also #9921 for a recent discussion).

Can you give us a bit more context on why you'd need to keep the dataset dimension order?

golmschenk · 2025-01-04T17:02:55Z

Can you give us a bit more context on why you'd need to keep the dataset dimension order?

For consistency when converting to and from Pandas (and from there to and from CSV). And how lack of consistency will affect adoption of using xarray in my team. To and from Pandas is where I originally noticed the inconsistency in the order.

Much of my team members are scientists that are not going to want to deal with xarray/Zarr directly. They have legacy code that works with an ad hoc text format (sorta CSV-like, but not quite) they've dealt with previously. These members would be perfectly happy working with CSV data through. However, it's often useful for some members of the team to have larger-than-memory and distributed processing (the ones who would be happy to work with xarray/Zarr directly). I'm looking to switch to xarray/Zarr to make this part easier for those who are involved in that component of the work. But still easily import and export subsets of the data back to the CSV other members of the team will use. The loss of column order from/to the CSVs makes switching to xarray more difficult to justify. One solution would be to add my own saving and loading of the order. But then any of the team members who will work with the xarray/Zarr data directly will also need to make sure to perform the same steps. This is doable, but an extra hurdle to justify switching to xarray. If the ordering of this was automatically consistent through xarray, that would remove this issue.

keewis · 2025-01-04T17:16:50Z

If you're using to_dataframe (to_pandas also calls that but doesn't let you pass arguments) you can try using Dataset.to_dataframe's dim_order parameter to choose a fixed/constant dimension order (see also #9718 for a recent discussion on that topic).

golmschenk · 2025-01-04T17:32:16Z

Thanks, but unfortunately, the columns/dimensions in our data are not fixed. For a given experiment, it's consistent across all the data from that experiment. But different experiments will have different columns/dimensions, so I'm not able to simply define a consistent order somewhere. It needs to be inferred from the data files. That is, one experiment might produce a collection of CSV(-like) files that all have the same columns. Here, is where xarray would come in to store the larger results in Zarr, process the data, and at some point spit back out some CSVs with the same columns. But then a different experiment would have different columns, but would still need similar processing.

jhamman · 2025-01-04T17:33:35Z

The ordering of list(Dataset.dims) is a red herring here. Or at a minimum, it isn't the core part of the problem you are after. The dims property of the Dataset is a mapping including all the names and sizes of dimensions for all variables in your dataset. Unlike the DataArray.dims property, it may not be ordered.

With that in mind, the actual problem here appears to be that to_dataframe seems to be relying on the ordering of the Dataset.dims property (or something similar -- I haven't looked at the code in a while).

We could change Dataset.dims to always be sorted somehow. Or we could update how to_datafame chooses its dimension order.

keewis · 2025-01-04T17:40:56Z

list(Dataset.dims) is the default value for the dim_order parameter. Sorting Dataset.dims won't work since we don't know (in general) how to compare dimension names, which can be any hashable.

However, user code that does know that all dims are strings can just do (use sorted's key parameter if you don't want to sort alphabetically):

ds.to_dataframe(dim_order=sorted(ds.dims))

which will give you a consistent (sorted) order, even if you don't know the actual dimensions.

golmschenk · 2025-01-04T17:52:00Z

Narrowing down to my more concrete situation (e.g., to and from Pandas), I just want to clarify that the issue only occurs after saving and reloading from Zarr. Just converting to and from Pandas and xarray alone does not encounter the issue I'm running into. For example:

import pandas as pd
import xarray as xr
from pathlib import Path

original_pandas_data_frame = pd.DataFrame({'b': [0, 1], 'a': [2, 3]})

original_xarray_dataset = original_pandas_data_frame.to_xarray()
pandas_data_frame_from_plain_xarray = original_xarray_dataset.to_pandas()

zarr_path = Path('example_dataset.zarr')
original_xarray_dataset.to_zarr(zarr_path, mode='w')
zarr_loaded_xarray_dataset = xr.open_zarr(zarr_path)
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.to_pandas()

print(original_pandas_data_frame.columns)
print(pandas_data_frame_from_plain_xarray.columns)
print(pandas_data_frame_from_zarr_loaded_xarray.columns)

outputs:

Index(['b', 'a'], dtype='object')
Index(['b', 'a'], dtype='object')
Index(['a', 'b'], dtype='object')

(This might have already been clear, but I just wanted to make sure)

keewis · 2025-01-04T21:16:06Z

I got that, thanks.

The dimension order of a dataset depends on the order in which it is seen on the variables from which the dataset is constructed. The zarr store returns variables in an alphabetical order (most likely because that's how it got them from the filesystem), which means xarray will see the dimensions in this order. For your first example that would be lat (from the lat coordinate), then lon (from the lon coordinate), then time (from precipitation).

Either way, if you rely on the order somehow it is good practice to explicitly define that order somewhere.

golmschenk · 2025-01-05T04:20:39Z

For my case then, I suppose I will try to store the dataset dimensions in the Zarr group attrs similar to how xarray currently stores the array dimensions in the Zarr array attrs. The downside of this is that we'll need to consistently use wrapped versions of xarray's Zarr saving and loading. I guess this might also be a solution for this feature request, but I don't know the types of obstacles that might prevent this, or if this would even fall outside the scope expected of xarray. Thank you much!

keewis · 2025-01-05T11:33:01Z

What I was thinking of was something like this:

def to_ordered_dataframe(ds):
    # use `sorted(ds.dims, key=lambda k: ...)` for a sorting other than alphabetical
    return ds.to_dataframe(dim_order=sorted(ds.dims))

...
pandas_data_frame_from_plain_xarray = original_xarray_dataset.pipe(to_ordered_dataframe)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
...

If you want to additionally store the expected order for exact roundtripping, I would define a wrapper around .to_dataframe:

def persist_dim_order(ds):
    return ds.assign_attrs(dim_order=list(ds.dims))

def to_ordered_dataframe(ds):
    dim_order = ds.attrs["dim_order"]
    return ds.to_dataframe(dim_order=dim_order)

...
original_xarray_dataset = original_pandas_data_frame.to_xarray().pipe(persist_dim_order)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
...

you could even put that into an accessor:

@xr.register_dataset_accessor("ordered_df")
class OrderedDFAccessor:
    def __init__(self, ds):
        self._ds = ds

    def persist_dim_order(self):
        return self._ds.assign_attrs(dim_order=list(self._ds.dims))

    def to_dataframe(self):
        dim_order = self._ds.attrs["dim_order"]
        return self._ds.to_dataframe(dim_order=dim_order)

...
original_xarray_dataset = original_pandas_data_frame.to_xarray().ordered_df.persist_dim_order()
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.ordered_df.to_dataframe()
...

golmschenk added the enhancement label Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store the dimension order when using Zarr #9924

Store the dimension order when using Zarr #9924

golmschenk commented Jan 4, 2025 •

edited

Loading

welcome bot commented Jan 4, 2025

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 •

edited

Loading

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 •

edited

Loading

jhamman commented Jan 4, 2025

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 •

edited

Loading

keewis commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 5, 2025 •

edited

Loading

keewis commented Jan 5, 2025 •

edited

Loading

Store the dimension order when using Zarr #9924

Store the dimension order when using Zarr #9924

Comments

golmschenk commented Jan 4, 2025 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

welcome bot commented Jan 4, 2025

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 • edited Loading

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 • edited Loading

jhamman commented Jan 4, 2025

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 • edited Loading

keewis commented Jan 4, 2025 • edited Loading

golmschenk commented Jan 5, 2025 • edited Loading

keewis commented Jan 5, 2025 • edited Loading

golmschenk commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 4, 2025 •

edited

Loading

keewis commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 5, 2025 •

edited

Loading

keewis commented Jan 5, 2025 •

edited

Loading