Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transpose() does not appear to change dataset dimension order. #9921

Open
5 tasks
Tristanjmeyers opened this issue Jan 2, 2025 · 10 comments
Open
5 tasks

transpose() does not appear to change dataset dimension order. #9921

Tristanjmeyers opened this issue Jan 2, 2025 · 10 comments

Comments

@Tristanjmeyers
Copy link

What happened?

Hello,

I am using .transpose() on an xr.Dataset in order to change the dimension order to make it cf-compliant, e.g. by having the dimension order be TIME, X, Y. While it appears to work when I do a print on the dataset, the dimension order doesn't actually change.

What did you expect to happen?

I expect that the listed dimension order (ds.dims) updates, so that I can specify the exact order of the dimensions as required for CF-compliance.

Minimal Complete Verifiable Example

import xarray as xr
import pandas as pd
import numpy as np

# Dataset construction taken from xr.Dataset documentation
np.random.seed(0)
temperature = 15 + 8 * np.random.randn(2, 3, 4)
precipitation = 10 * np.random.rand(2, 3, 4)
lon = [-99.83, -99.32]
lat = [42.25, 42.21]
instruments = ["manufac1", "manufac2", "manufac3"]
time = pd.date_range("2014-09-06", periods=4)
reference_time = pd.Timestamp("2014-09-05")

ds = xr.Dataset(
    data_vars=dict(
        temperature=(["loc", "instrument", "time"], temperature),
        precipitation=(["loc", "instrument", "time"], precipitation),
    ),
    coords=dict(
        lon=("loc", lon),
        lat=("loc", lat),
        instrument=instruments,
        time=time,
        reference_time=reference_time,
    ),
    attrs=dict(description="Weather related data."),
)

ds_transposed = ds.transpose('time','loc','instrument')

# Order has not changed, returns True
ds_transposed.dims == ds.dims

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

<frozen _collections_abc>:834: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
<frozen _collections_abc>:894: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
True

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.8 | packaged by conda-forge | (main, Dec 5 2024, 14:24:40) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8
LOCALE: ('en_NZ', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: 4.9.2

xarray: 2024.11.0
pandas: 2.2.3
numpy: 2.2.1
scipy: 1.14.1
netCDF4: 1.7.2
pydap: None
h5netcdf: 1.4.1
h5py: 3.12.1
zarr: None
cftime: 1.6.4
nc_time_axis: None
iris: 3.11.0
bottleneck: 1.4.2
dask: 2024.12.1
distributed: 2024.12.1
matplotlib: 3.10.0
cartopy: 0.24.0
seaborn: None
numbagg: None
fsspec: 2024.12.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.6.0
pip: 24.3.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

@Tristanjmeyers Tristanjmeyers added bug needs triage Issue that has not been reviewed by xarray team member labels Jan 2, 2025
Copy link

welcome bot commented Jan 2, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@keewis keewis added usage question and removed bug needs triage Issue that has not been reviewed by xarray team member labels Jan 2, 2025
@keewis
Copy link
Collaborator

keewis commented Jan 2, 2025

transpose by design does not touch the order in ds.sizes / ds.dims; that is set by the constructor and remains constant afterwards. Instead, it changes the order of dimensions of the variables (both data variables and coordinates, for DataArray.transpose this can be controlled by a keyword argument):

transposed = ds.transpose('time','loc','instrument')

assert ds["temperature"].dims != transposed["temperature"].dims

@Tristanjmeyers
Copy link
Author

Thanks @keewis . How would one then achieve this re-ordering of dimensions in ds.dims?

@keewis
Copy link
Collaborator

keewis commented Jan 3, 2025

you'd have to recreate the dataset from the variables (if I remember correctly... there might be a better way to do this):

transposed = ds.transpose("time", "loc", "instrument")
recreated = xr.Dataset(
    data_vars={k: arr.variable for k, v in transposed.data_vars.items()},
    coords=transposed.coords,
    attrs=transposed.attrs,
)

However, you shouldn't need to do this, as the order in ds.dims changes very little of the behavior of xarray: as far as I remember, only the string repr / HTML repr and the default value for to_dataframe's dim_order parameter are affected.

@Tristanjmeyers
Copy link
Author

Thanks, appreciate it. I will try this.

As you have pointed out, this is pretty inconsequential in xarray, but I have to often save netcdf files with strict dimension ordering and metadata so these other programs can read them without errors, e.g. R, NCL, and THREDDS data servers.

Currently, I save the data with to_netcdf(), then use CDO to do a cdo copy on the netcdf to create another netcdf which can be "universally read" by these other programs. I'm trying to have an xarray-only solution to this, and I suspect dimension ordering is one of the issues, as CDO moves the order of them around to be congruent with cf-compliance.

@keewis
Copy link
Collaborator

keewis commented Jan 3, 2025

hmm... I wonder if we have an encoding setting for this? @dcherian, do you have any ideas?

@dcherian
Copy link
Contributor

dcherian commented Jan 3, 2025

I don't think so, but I am also very confused. The dimension order in the file will follow the order of appearance as you iterate through variables AFAICT

dims = {}
for v in unlimited_dims: # put unlimited_dims first
dims[v] = None
for v in variables.values():
dims.update(dict(zip(v.dims, v.shape, strict=True)))

So what is cdo copy changing about the file? Can you post a diff of ncdump -sh for the xarray and cdo files please?

@Tristanjmeyers
Copy link
Author

Okay sure. Here's a real world example. It's a post-processed EPS file from ECMWF's open data.

via to_netcdf:

> ncdump -h to_netcdf.nc 
netcdf to_netcdf {
dimensions:
	latitude = 4 ;
	longitude = 8 ;
	time = 215 ;
variables:
	double latitude(latitude) ;
		latitude:_FillValue = NaN ;
		latitude:units = "degrees_north" ;
		latitude:standard_name = "latitude" ;
		latitude:long_name = "latitude" ;
		latitude:stored_direction = "decreasing" ;
	double longitude(longitude) ;
		longitude:_FillValue = NaN ;
		longitude:units = "degrees_east" ;
		longitude:standard_name = "longitude" ;
		longitude:long_name = "longitude" ;
	int time(time) ;
		time:units = "days since 2024-12-02 00:00:00" ;
		time:calendar = "proleptic_gregorian" ;
	float tp(time, longitude, latitude) ;
		tp:_FillValue = NaNf ;
		tp:GRIB_paramId = 228 ;
		tp:GRIB_dataType = "fc" ;
		tp:GRIB_numberOfPoints = 32 ;
		tp:GRIB_typeOfLevel = "surface" ;
		tp:GRIB_stepUnits = 1 ;
		tp:GRIB_stepType = "instant" ;
		tp:GRIB_gridType = "regular_ll" ;
		tp:GRIB_uvRelativeToGrid = 0 ;
		tp:GRIB_NV = 0 ;
		tp:GRIB_Nx = 8 ;
		tp:GRIB_Ny = 4 ;
		tp:GRIB_cfName = "unknown" ;
		tp:GRIB_cfVarName = "tp" ;
		tp:GRIB_gridDefinitionDescription = "Latitude/Longitude Grid" ;
		tp:GRIB_iDirectionIncrementInDegrees = 1. ;
		tp:GRIB_iScansNegatively = 0 ;
		tp:GRIB_jDirectionIncrementInDegrees = 1. ;
		tp:GRIB_jPointsAreConsecutive = 0 ;
		tp:GRIB_jScansPositively = 0 ;
		tp:GRIB_latitudeOfFirstGridPointInDegrees = -12. ;
		tp:GRIB_latitudeOfLastGridPointInDegrees = -15. ;
		tp:GRIB_longitudeOfFirstGridPointInDegrees = -175. ;
		tp:GRIB_longitudeOfLastGridPointInDegrees = -168. ;
		tp:GRIB_missingValue = 3.40282346638529e+38 ;
		tp:GRIB_name = "Total precipitation" ;
		tp:GRIB_shortName = "tp" ;
		tp:GRIB_system = 51 ;
		tp:GRIB_totalNumber = 0 ;
		tp:GRIB_units = "m" ;
		tp:long_name = "Total precipitation" ;
		tp:units = "m" ;
		tp:standard_name = "unknown" ;
		tp:GRIB_surface = 0. ;
		tp:cell_methods = "number: mean" ;
		tp:coordinates = "time step latitude longitude valid_time" ;
}

Then I did a cdo copy of the same dataset:

> ncdump -h cdo-copy.nc 
netcdf cdo-copy {
dimensions:
	time = UNLIMITED ; // (215 currently)
	longitude = 8 ;
	latitude = 4 ;
variables:
	double time(time) ;
		time:standard_name = "time" ;
		time:units = "days since 2024-12-02 00:00:00" ;
		time:calendar = "proleptic_gregorian" ;
		time:axis = "T" ;
	double longitude(longitude) ;
		longitude:standard_name = "longitude" ;
		longitude:long_name = "longitude" ;
		longitude:units = "degrees_east" ;
		longitude:axis = "X" ;
	double latitude(latitude) ;
		latitude:standard_name = "latitude" ;
		latitude:long_name = "latitude" ;
		latitude:units = "degrees_north" ;
		latitude:axis = "Y" ;
	float tp(time, longitude, latitude) ;
		tp:standard_name = "unknown" ;
		tp:long_name = "Total precipitation" ;
		tp:units = "m" ;
		tp:_FillValue = NaNf ;
		tp:missing_value = NaNf ;
		tp:GRIB_paramId = 228 ;
		tp:GRIB_dataType = "fc" ;
		tp:GRIB_numberOfPoints = 32 ;
		tp:GRIB_typeOfLevel = "surface" ;
		tp:GRIB_stepUnits = 1 ;
		tp:GRIB_stepType = "instant" ;
		tp:GRIB_gridType = "regular_ll" ;
		tp:GRIB_uvRelativeToGrid = 0 ;
		tp:GRIB_NV = 0 ;
		tp:GRIB_Nx = 8 ;
		tp:GRIB_Ny = 4 ;
		tp:GRIB_cfName = "unknown" ;
		tp:GRIB_cfVarName = "tp" ;
		tp:GRIB_gridDefinitionDescription = "Latitude/Longitude Grid" ;
		tp:GRIB_iDirectionIncrementInDegrees = 1. ;
		tp:GRIB_iScansNegatively = 0 ;
		tp:GRIB_jDirectionIncrementInDegrees = 1. ;
		tp:GRIB_jPointsAreConsecutive = 0 ;
		tp:GRIB_jScansPositively = 0 ;
		tp:GRIB_latitudeOfFirstGridPointInDegrees = -12. ;
		tp:GRIB_latitudeOfLastGridPointInDegrees = -15. ;
		tp:GRIB_longitudeOfFirstGridPointInDegrees = -175. ;
		tp:GRIB_longitudeOfLastGridPointInDegrees = -168. ;
		tp:GRIB_missingValue = 3.40282346638529e+38 ;
		tp:GRIB_name = "Total precipitation" ;
		tp:GRIB_shortName = "tp" ;
		tp:GRIB_system = 51 ;
		tp:GRIB_totalNumber = 0 ;
		tp:GRIB_units = "m" ;
		tp:GRIB_surface = 0. ;
		tp:cell_methods = "number: mean" ;

// global attributes:
		:CDI = "Climate Data Interface version 1.9.5 (http://mpimet.mpg.de/cdi)" ;
		:Conventions = "CF-1.6" ;
		:history = "Fri Jan 03 00:49:49 2025: cdo copy test.nc test_cdo-ed.nc" ;
		:CDO = "Climate Data Operators version 1.9.5 (http://mpimet.mpg.de/cdo)" ;
}

Note the dimension ordering, which is T, ..., standard for cf-compliance, and the addition of "axis" in the metadata.

I think this is what is required for other programs to read the netcdf correctly. I'm still debugging this though, hence why I've been trying to use "transpose". However, as @keewis suggested, I can just create a new dataset with the correct dimension ordering and populate it with the data.

@dcherian
Copy link
Contributor

dcherian commented Jan 3, 2025

time = 215 ;
time = UNLIMITED ; // (215 currently)

I think this is the real difference. CDO is doing some interpretation and adding attributes which Xarray will not do. You can specify .attrs['axis']="T" and .to_netcdf(..., unlimited_dims=("time",)) to get to that output I think.

Presumably some of the software you are using expects an "unlimited dim"?

T, ..., standard for cf-compliance,

Note that CF "recommends" it and the ordering is not required. Also note that the data variables have the dimension ordering you specified: float tp(time, longitude, latitude).

EDIT: I missed this. It is still funny that "time" appears last and not first. Ah, it's presumably because the "indexed" coordinate variables appear first in the variables dict.

@Tristanjmeyers
Copy link
Author

Tristanjmeyers commented Jan 3, 2025

FWIW I've tried adding the metadata attributes itself and it doesn't seem to make a difference, the only thing that works is the CDO copy, so that's why I'm trying the transposing of the dimensions now.

Programs tested on:

Happy to share what I find :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants