Add function to combine H5 dat files #6298

knelli2 · 2024-09-19T23:38:00Z

Proposed changes

Towards #6246.

This function is distinct from the existing CLI function spectre combine-h5 dat because this takes care of overlapping times in the dat subfiles (we could optionally replace the python function with this one if we bind this function). Also, this is a C++ function and not a python function because it will be used in the ReduceCceWorldtube executable which must be statically linked.

Upgrade instructions

Code review checklist

The code is documented and the documentation renders correctly. Run
make doc to generate the documentation locally into BUILD_DIR/docs/html.
Then open index.html.
The code follows the stylistic and code quality guidelines listed in the
code review guide.
The PR lists upgrade instructions and is labeled bugfix or
new feature if appropriate.

Further comments

nilsdeppe

Please read all the comments first. I give a few different options that I think would all resolve my concerns and I don't have a preference for which to do :)

nilsdeppe · 2024-09-23T18:09:41Z

src/IO/H5/CombineH5.cpp

+      // in the sequence of files since we are looping backward) is before
+      // any of the times in this file. If so, don't include those times.
+      std::optional<size_t> row = times.rows() - 1;
+      while (times(row.value(), 0) >= earliest_time.value()) {


I'm not sure that we guarantee that our HDF5 Dat output is actually sorted in time. I think you need to first sort the matrix by the first column before combining. I think what you need to do is store the index of the sorted matrix, and then sort again in the next for loop below.

nilsdeppe · 2024-09-23T18:12:35Z

src/IO/H5/CombineH5.cpp

+      // the number of times and the earliest time of this file
+      if (not earliest_time.has_value()) {
+        num_time_map.at(dat_filename)[index] = dimensions[0];
+        earliest_time = times(0, 0);


This is not guarantee to be the earliest time. It's the first time written, but that doesn't have to be the earliest time, I think. I believe our observers do not enforce ordering on write.

nilsdeppe · 2024-09-23T18:14:20Z

src/IO/H5/CombineH5.cpp

@@ -149,4 +157,186 @@ void combine_h5_vol(const std::vector<std::string>& file_names,
    new_file.close_current_object();
  }
 }
+
+void combine_h5_dat(const std::vector<std::string>& h5_files_to_combine,


I think you assume that the files are in increasing order in time. It would be good to add a check for that since otherwise users may be very surprised. Another option would be to just sort the list in increasing order for the subfile(s).

nilsdeppe · 2024-09-23T18:16:09Z

src/IO/H5/CombineH5.cpp

+      // Only append data if we include data from this file
+      if (num_times.has_value()) {
+        // Always start with row 0
+        const Matrix data_to_append = input_dat_file.get_data_subset(


This will need to load the entire matrix, sort it, and then trim it before write.

nilsdeppe · 2024-09-23T18:17:18Z

src/IO/H5/CombineH5.hpp

+ * will be ignored and will not appear in \p output_h5_filename. This function
+ * also assumes that the times in each of the \p h5_files_to_combine are already
+ * sorted.


I think this second assumption is generally not valid for spectre, unfortunately. If you would like to keep it, I think you should add a check that it's true because someone will inevitably run this over a file where it's not true.

nilsdeppe · 2024-09-23T18:17:57Z

src/IO/H5/CombineH5.hpp

+ * meaning if you have data in `File1.h5` and `File2.h5` and if the first time
+ * in `File2.h5` is before some times in `File1.h5`, those times in `File1.h5`
+ * will be discarded and won't appear in the combined H5 file.


I think the constraint is quite strong and should be explicit: the files past must be in increasing time order.

nilsdeppe · 2024-09-23T18:20:43Z

tests/Unit/IO/H5/Test_CombineH5.cpp

+    if (file_system::check_if_file_exists(filename)) {
+      file_system::rm(filename, true);
+    }
+  }


It would be good to add tests for catching unordered files passed in, and for unordered data in the subfiles. I worry that these will both be common mistakes.

knelli2 · 2024-09-27T23:49:14Z

I decided to simply assert that the H5 files were monotonically increasing in their earliest time in the files. I think something else should be responsible for actually putting the H5 files in order properly. This also allows for the times in each dat file to be unordered. One downside is that we have to read all data in from a dat file first, sort it, then trim the overlapping times we don't need. But that shouldn't be too bad.

To facilitate this sorting (because a Matrix can't be sorted easily) I just added overloads to the dat.get_data() to return the data as std::vector<std::vector<double>> which can be sorted very easily. These are the two new commits before the fixup

nilsdeppe

LGTM, a few minor suggestions you can make when you squash. Thanks for doing this!

nilsdeppe · 2024-09-30T22:28:06Z

src/IO/H5/CombineH5.cpp


      // Makes things easier below.
-      if (UNLIKELY(times.rows() == 0)) {
+      if (UNLIKELY(times.size() == 0)) {


clang-tidy: empty() when you squash

nilsdeppe · 2024-10-01T13:40:54Z

src/IO/H5/CombineH5.cpp

+                       << h5_files_to_combine
+                       << " are not monotonically increasing in their first "
+                          "times for dat file "
+                       << dat_filename << "");


remove << ""?

nilsdeppe · 2024-10-01T13:42:33Z

src/IO/H5/CombineH5.cpp

+      // the times in this file. If so, don't include those times.
+      size_t row = times.size() - 1;
+      while (times[row][0] >= earliest_time.value()) {
+        // This should have been taken care of before so we should never get


I had to think about this (terrible, I know), but maybe change to ~ Make sure we don't reach row 0 since that would mean the files are not ordered. We should've checked for this above, so this is an additional safety check.

Now they can return either a Matrix or a vector<vector<double>>

This function also handles overlapping times by taking the "latest" data always.

knelli2 requested a review from nilsdeppe September 19, 2024 23:38

knelli2 mentioned this pull request Sep 19, 2024

ICERM CCE Feedback #6246

Open

13 tasks

knelli2 force-pushed the h5_dat_combine branch 2 times, most recently from 7c0a988 to a7d1cef Compare September 20, 2024 00:09

nilsdeppe requested changes Sep 23, 2024

View reviewed changes

knelli2 force-pushed the h5_dat_combine branch 3 times, most recently from 5935c42 to a67d874 Compare September 27, 2024 23:48

nilsdeppe reviewed Oct 1, 2024

View reviewed changes

knelli2 added 4 commits October 1, 2024 11:27

Rename combine_h5 to combine_h5_vol

5400a4d

Reorganize an H5 helper

e92cda8

Add overload to dat and cce get_data functions

800796d

Now they can return either a Matrix or a vector<vector<double>>

Add function to combine H5 dat files

48fe568

This function also handles overlapping times by taking the "latest" data always.

knelli2 force-pushed the h5_dat_combine branch from a67d874 to 48fe568 Compare October 1, 2024 18:29

nilsdeppe approved these changes Oct 2, 2024

View reviewed changes

nilsdeppe merged commit f42ec5d into sxs-collaboration:develop Oct 2, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function to combine H5 dat files #6298

Add function to combine H5 dat files #6298

knelli2 commented Sep 19, 2024

nilsdeppe left a comment

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

knelli2 commented Sep 27, 2024 •

edited

Loading

nilsdeppe left a comment

nilsdeppe Sep 30, 2024

nilsdeppe Oct 1, 2024

nilsdeppe Oct 1, 2024

Add function to combine H5 dat files #6298

Add function to combine H5 dat files #6298

Conversation

knelli2 commented Sep 19, 2024

Proposed changes

Upgrade instructions

Code review checklist

Further comments

nilsdeppe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knelli2 commented Sep 27, 2024 • edited Loading

nilsdeppe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knelli2 commented Sep 27, 2024 •

edited

Loading