Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HDFVirtualBackend handling of non coordinate dimension HDF datasets. #410

Merged
merged 5 commits into from
Jan 31, 2025

Conversation

sharkinsspatial
Copy link
Collaborator

Copy link

codecov bot commented Jan 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.97%. Comparing base (81a76f0) to head (ce118fb).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #410      +/-   ##
==========================================
+ Coverage   75.61%   77.97%   +2.35%     
==========================================
  Files          33       31       -2     
  Lines        1993     1834     -159     
==========================================
- Hits         1507     1430      -77     
+ Misses        486      404      -82     
Files with missing lines Coverage Δ
virtualizarr/readers/hdf/hdf.py 95.08% <100.00%> (+0.40%) ⬆️

... and 3 files with indirect coverage changes

@sharkinsspatial
Copy link
Collaborator Author

For some reason I can't add @jsignell as a reviewer for this PR directly.

Copy link
Contributor

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I just took a little read through but it works great!


vds = open_virtual_dataset(simple_netcdf4, loadable_variables=["foo"])
assert vds.virtualize.nbytes == 104
assert vds.virtualize.nbytes == 48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏🏻

Comment on lines 413 to 423
dimension_names = []
non_coordinate_dimesion_variables = []
for name, obj in group.items():
if "_Netcdf4Dimid" in obj.attrs:
dimension_names.append(name)
for name, obj in group.items():
if type(obj) is h5py.Dataset:
if obj.id.get_storage_size() == 0 and name in dimension_names:
non_coordinate_dimesion_variables.append(name)

return non_coordinate_dimesion_variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be simplified to:

Suggested change
dimension_names = []
non_coordinate_dimesion_variables = []
for name, obj in group.items():
if "_Netcdf4Dimid" in obj.attrs:
dimension_names.append(name)
for name, obj in group.items():
if type(obj) is h5py.Dataset:
if obj.id.get_storage_size() == 0 and name in dimension_names:
non_coordinate_dimesion_variables.append(name)
return non_coordinate_dimesion_variables
return [
name
for name, obj in group.items()
if isinstance(obj, h5py.Dataset)
and "_Netcdf4Dimid" in obj.attrs
and obj.id.get_storage_size() == 0
]

Copy link
Member

@TomNicholas TomNicholas Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but Sean's version it's melting my brain to look at

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this swap locally and it worked fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually an indent error on my part 🤦 . We need to first scan all items to determine all dimensions and then subsequently check if any of the those dimensions are represented by 0 size HDF5 datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are iterating over group.items both times though right? So you could collapse that into one iteration. There is no way to get a name in dimension_names without the related obj having obj.attrs == "_Netcdf4Dimid".

But this is all just a nitpick. I think the only part that matters is using isinstance rather than type

non_coordinate_dimesion_vars = HDFVirtualBackend._find_non_coord_dimension_vars(
group=g
)
drop_variables = list(set(drop_variables + non_coordinate_dimesion_vars))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick but I think this is more legible and faster:

Suggested change
drop_variables = list(set(drop_variables + non_coordinate_dimesion_vars))
drop_variables = [*drop_variables, *non_coordinate_dimension_vars]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speed of this is not important, and I think the first one makes intent clearer (if the intent is to de-duplicate using set?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I was getting too nitpicky. Just the typo then :)

Comment on lines +374 to +376
non_coordinate_dimesion_vars = HDFVirtualBackend._find_non_coord_dimension_vars(
group=g
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: typo (dimesion -> dimension)

Suggested change
non_coordinate_dimesion_vars = HDFVirtualBackend._find_non_coord_dimension_vars(
group=g
)
non_coordinate_dimension_vars = (
HDFVirtualBackend._find_non_coord_dimension_vars(group=g)
)

@jsignell jsignell mentioned this pull request Jan 31, 2025
6 tasks
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @sharkinsspatial !

@TomNicholas
Copy link
Member

For some reason I can't add @jsignell as a reviewer for this PR directly.

Same reason as this #392 (comment), and I forgot to ask Josh to add you to the team that has merge permissions 😞

@sharkinsspatial sharkinsspatial merged commit 61847e9 into main Jan 31, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HDF reader creating spurious dimension coordinates
3 participants