-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function to combine H5 dat files #6298
Conversation
7c0a988
to
a7d1cef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read all the comments first. I give a few different options that I think would all resolve my concerns and I don't have a preference for which to do :)
src/IO/H5/CombineH5.cpp
Outdated
// in the sequence of files since we are looping backward) is before | ||
// any of the times in this file. If so, don't include those times. | ||
std::optional<size_t> row = times.rows() - 1; | ||
while (times(row.value(), 0) >= earliest_time.value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we guarantee that our HDF5 Dat output is actually sorted in time. I think you need to first sort the matrix by the first column before combining. I think what you need to do is store the index of the sorted matrix, and then sort again in the next for loop below.
src/IO/H5/CombineH5.cpp
Outdated
// the number of times and the earliest time of this file | ||
if (not earliest_time.has_value()) { | ||
num_time_map.at(dat_filename)[index] = dimensions[0]; | ||
earliest_time = times(0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not guarantee to be the earliest time. It's the first time written, but that doesn't have to be the earliest time, I think. I believe our observers do not enforce ordering on write.
@@ -149,4 +157,186 @@ void combine_h5_vol(const std::vector<std::string>& file_names, | |||
new_file.close_current_object(); | |||
} | |||
} | |||
|
|||
void combine_h5_dat(const std::vector<std::string>& h5_files_to_combine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you assume that the files are in increasing order in time. It would be good to add a check for that since otherwise users may be very surprised. Another option would be to just sort the list in increasing order for the subfile(s).
src/IO/H5/CombineH5.cpp
Outdated
// Only append data if we include data from this file | ||
if (num_times.has_value()) { | ||
// Always start with row 0 | ||
const Matrix data_to_append = input_dat_file.get_data_subset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to load the entire matrix, sort it, and then trim it before write.
src/IO/H5/CombineH5.hpp
Outdated
* will be ignored and will not appear in \p output_h5_filename. This function | ||
* also assumes that the times in each of the \p h5_files_to_combine are already | ||
* sorted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this second assumption is generally not valid for spectre, unfortunately. If you would like to keep it, I think you should add a check that it's true because someone will inevitably run this over a file where it's not true.
src/IO/H5/CombineH5.hpp
Outdated
* meaning if you have data in `File1.h5` and `File2.h5` and if the first time | ||
* in `File2.h5` is before some times in `File1.h5`, those times in `File1.h5` | ||
* will be discarded and won't appear in the combined H5 file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the constraint is quite strong and should be explicit: the files past must be in increasing time order.
if (file_system::check_if_file_exists(filename)) { | ||
file_system::rm(filename, true); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add tests for catching unordered files passed in, and for unordered data in the subfiles. I worry that these will both be common mistakes.
5935c42
to
a67d874
Compare
I decided to simply assert that the H5 files were monotonically increasing in their earliest time in the files. I think something else should be responsible for actually putting the H5 files in order properly. This also allows for the times in each dat file to be unordered. One downside is that we have to read all data in from a dat file first, sort it, then trim the overlapping times we don't need. But that shouldn't be too bad. To facilitate this sorting (because a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few minor suggestions you can make when you squash. Thanks for doing this!
src/IO/H5/CombineH5.cpp
Outdated
|
||
// Makes things easier below. | ||
if (UNLIKELY(times.rows() == 0)) { | ||
if (UNLIKELY(times.size() == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy: empty()
when you squash
src/IO/H5/CombineH5.cpp
Outdated
<< h5_files_to_combine | ||
<< " are not monotonically increasing in their first " | ||
"times for dat file " | ||
<< dat_filename << ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove << ""
?
src/IO/H5/CombineH5.cpp
Outdated
// the times in this file. If so, don't include those times. | ||
size_t row = times.size() - 1; | ||
while (times[row][0] >= earliest_time.value()) { | ||
// This should have been taken care of before so we should never get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to think about this (terrible, I know), but maybe change to ~ Make sure we don't reach row 0 since that would mean the files are not ordered. We should've checked for this above, so this is an additional safety check.
Now they can return either a Matrix or a vector<vector<double>>
This function also handles overlapping times by taking the "latest" data always.
a67d874
to
48fe568
Compare
Proposed changes
Towards #6246.
This function is distinct from the existing CLI function
spectre combine-h5 dat
because this takes care of overlapping times in the dat subfiles (we could optionally replace the python function with this one if we bind this function). Also, this is a C++ function and not a python function because it will be used in theReduceCceWorldtube
executable which must be statically linked.Upgrade instructions
Code review checklist
make doc
to generate the documentation locally intoBUILD_DIR/docs/html
.Then open
index.html
.code review guide.
bugfix
ornew feature
if appropriate.Further comments