WIP, ENH: parquet demo #929

tylerjereddy · 2023-05-06T20:58:02Z

We've had some discussions about supporting a parquet or arrow-like format to avoid the various idiosyncrancies and performance issues related to the in-house binary format, possibly through a converter of the binary format to parquet format. This is more of a demo than something that is meant for serious code review, for now...

a quick demo with POSIX-only support for a single summary report table for parquet input, and a converter that only supports POSIX and was only tested on a single log file
this does appear to allow the full test suite to pass while adding incredibly-crude summary report support for working with a parquet file that has POSIX counter/fcounter data
there are a few reasons to demo this:

It may help spark some discussion about how this should work because I already made some potentially-controversial decisions like concatenating along the columns to fuse counter and fcounters
The various TODO comments I added around try/except blocks should give a good indicator of the number of places in the code where changes would be needed to produce a more complete summary report from parquet input
Sometimes it is easier to develop from a (crude) prototype if a summer student picks this up (vs. from scratch)

The example below shows what happens when producing the summary report with the 1 parquet file I tested with. It correctly reproduces a single table in the report since that is all I added support for, for now. Perhaps the other notable observation is that the gzipped parquet file is about 7X larger than the native binary file, and the native binary also contains more raw data because we're currently excluding DXT_POSIX for the parquet format, for now. I don't consider file size/compression a priority at this stage of development/consideration though.

python -m darshan summary /Users/treddy/rough_work/darshan/test_parquet/runtime_and_dxt_heatmaps_diagonal_write_only.parquet.gzip

* a quick demo with POSIX-only support for a single summary report plot for parquet input, and a converter that only support POSIX and was only tested on a single log file * this does appear to allow the full test suite to pass while adding incredeibly-crude summary report support for working with a parquet file that has POSIX counter/fcounter data * there are a few reasons to demo this: 1) It may help spark some discussion about how this should work because I already made some potentially-controversial decisions like concatenating along the columns to fuse counter and fcounters 2) The various `TODO` comments I added around try/except blocks should give a good indicator of the number of places in the code where changes would be needed to produce a more complete summary report from parquet input 3) Sometimes it is easier to develop from a (crude) prototype if a summer student picks this up (vs. from scratch)

tylerjereddy · 2023-05-06T20:59:35Z

darshan-util/pydarshan/darshan/experimental/plots/plot_common_access_table.py

+        mod_df = report.records[mod].to_df(attach=None)["counters"]
+    except AttributeError:
+        # TODO: fix for parquet format
+        mod_df = report.iloc[..., :71]


This is the one table I added support for in the parquet-based summary report, since it was so easy. You just take the dataframe you loaded in from the parquet file, and slice it to get the counters sub-portion of the fused counters + fcounters DataFrame.

How we store metadata (job runtime and stuff...), and avoid having to hard-code magic numbers like that for the number of columns for different modules/counters is probably a design question.

* minor mypy shims to get type checking passing again

tylerjereddy · 2023-05-06T21:15:18Z

darshan-util/pydarshan/darshan/log_utils.py

+        recs = report.data["records"]
+        df_posix_counters = recs["POSIX"].to_df()["counters"]
+        df_posix_fcounters = recs["POSIX"].to_df()["fcounters"]
+        # NOTE: is it always true that counters and counters will


Suggested change

# NOTE: is it always true that counters and counters will

# NOTE: is it always true that counters and fcounters will

carns · 2023-05-09T19:16:44Z

Neat! Can you explain how to run the converter and/or share the example parquet file?

tylerjereddy · 2023-05-09T20:50:12Z

Neat! Can you explain how to run the converter and/or share the example parquet file?

A Python script like this should do the trick locally, if you're all setup on this feature branch with the logs repo installed as well, etc.

from darshan.log_utils import get_log_path, convert_to_parquet

log_path = get_log_path("runtime_and_dxt_heatmaps_diagonal_write_only.darshan")
convert_to_parquet(log_path, "output.parquet.gzip")

Then produce the HTML report as usual with the parquet file. Obviously only POSIX is handled at the moment.

tylerjereddy added pydarshan parquet/arrow labels May 6, 2023

tylerjereddy commented May 6, 2023

View reviewed changes

MAINT: PR 929 revisions

1b1ef75

* minor mypy shims to get type checking passing again

tylerjereddy commented May 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP, ENH: parquet demo #929

WIP, ENH: parquet demo #929

tylerjereddy commented May 6, 2023 •

edited

Loading

tylerjereddy May 6, 2023 •

edited

Loading

tylerjereddy May 6, 2023

carns commented May 9, 2023

tylerjereddy commented May 9, 2023

	# NOTE: is it always true that counters and counters will
	# NOTE: is it always true that counters and fcounters will

WIP, ENH: parquet demo #929

Are you sure you want to change the base?

WIP, ENH: parquet demo #929

Conversation

tylerjereddy commented May 6, 2023 • edited Loading

tylerjereddy May 6, 2023 • edited Loading

Choose a reason for hiding this comment

tylerjereddy May 6, 2023

Choose a reason for hiding this comment

carns commented May 9, 2023

tylerjereddy commented May 9, 2023

tylerjereddy commented May 6, 2023 •

edited

Loading

tylerjereddy May 6, 2023 •

edited

Loading