-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix HDFVirtualBackend handling of non coordinate dimension HDF datasets. #410
Conversation
sharkinsspatial
commented
Jan 31, 2025
- Closes HDF reader creating spurious dimension coordinates #401
- Tests added
- Tests passing
- Full type hint coverage
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #410 +/- ##
==========================================
+ Coverage 75.61% 77.97% +2.35%
==========================================
Files 33 31 -2
Lines 1993 1834 -159
==========================================
- Hits 1507 1430 -77
+ Misses 486 404 -82
|
4002179
to
5d12003
Compare
For some reason I can't add @jsignell as a reviewer for this PR directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I just took a little read through but it works great!
|
||
vds = open_virtual_dataset(simple_netcdf4, loadable_variables=["foo"]) | ||
assert vds.virtualize.nbytes == 104 | ||
assert vds.virtualize.nbytes == 48 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏🏻
virtualizarr/readers/hdf/hdf.py
Outdated
dimension_names = [] | ||
non_coordinate_dimesion_variables = [] | ||
for name, obj in group.items(): | ||
if "_Netcdf4Dimid" in obj.attrs: | ||
dimension_names.append(name) | ||
for name, obj in group.items(): | ||
if type(obj) is h5py.Dataset: | ||
if obj.id.get_storage_size() == 0 and name in dimension_names: | ||
non_coordinate_dimesion_variables.append(name) | ||
|
||
return non_coordinate_dimesion_variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be simplified to:
dimension_names = [] | |
non_coordinate_dimesion_variables = [] | |
for name, obj in group.items(): | |
if "_Netcdf4Dimid" in obj.attrs: | |
dimension_names.append(name) | |
for name, obj in group.items(): | |
if type(obj) is h5py.Dataset: | |
if obj.id.get_storage_size() == 0 and name in dimension_names: | |
non_coordinate_dimesion_variables.append(name) | |
return non_coordinate_dimesion_variables | |
return [ | |
name | |
for name, obj in group.items() | |
if isinstance(obj, h5py.Dataset) | |
and "_Netcdf4Dimid" in obj.attrs | |
and obj.id.get_storage_size() == 0 | |
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, but Sean's version it's melting my brain to look at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this swap locally and it worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually an indent error on my part 🤦 . We need to first scan all items to determine all dimensions and then subsequently check if any of the those dimensions are represented by 0 size HDF5 datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are iterating over group.items
both times though right? So you could collapse that into one iteration. There is no way to get a name in dimension_names
without the related obj
having obj.attrs == "_Netcdf4Dimid"
.
But this is all just a nitpick. I think the only part that matters is using isinstance
rather than type
non_coordinate_dimesion_vars = HDFVirtualBackend._find_non_coord_dimension_vars( | ||
group=g | ||
) | ||
drop_variables = list(set(drop_variables + non_coordinate_dimesion_vars)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick but I think this is more legible and faster:
drop_variables = list(set(drop_variables + non_coordinate_dimesion_vars)) | |
drop_variables = [*drop_variables, *non_coordinate_dimension_vars] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speed of this is not important, and I think the first one makes intent clearer (if the intent is to de-duplicate using set?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I was getting too nitpicky. Just the typo then :)
non_coordinate_dimesion_vars = HDFVirtualBackend._find_non_coord_dimension_vars( | ||
group=g | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: typo (dimesion
-> dimension
)
non_coordinate_dimesion_vars = HDFVirtualBackend._find_non_coord_dimension_vars( | |
group=g | |
) | |
non_coordinate_dimension_vars = ( | |
HDFVirtualBackend._find_non_coord_dimension_vars(group=g) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @sharkinsspatial !
Same reason as this #392 (comment), and I forgot to ask Josh to add you to the team that has merge permissions 😞 |