-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP, ENH: parquet demo #929
base: main
Are you sure you want to change the base?
Conversation
* a quick demo with POSIX-only support for a single summary report plot for parquet input, and a converter that only support POSIX and was only tested on a single log file * this does appear to allow the full test suite to pass while adding incredeibly-crude summary report support for working with a parquet file that has POSIX counter/fcounter data * there are a few reasons to demo this: 1) It may help spark some discussion about how this should work because I already made some potentially-controversial decisions like concatenating along the columns to fuse counter and fcounters 2) The various `TODO` comments I added around try/except blocks should give a good indicator of the number of places in the code where changes would be needed to produce a more complete summary report from parquet input 3) Sometimes it is easier to develop from a (crude) prototype if a summer student picks this up (vs. from scratch)
mod_df = report.records[mod].to_df(attach=None)["counters"] | ||
except AttributeError: | ||
# TODO: fix for parquet format | ||
mod_df = report.iloc[..., :71] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the one table I added support for in the parquet-based summary report, since it was so easy. You just take the dataframe you loaded in from the parquet file, and slice it to get the counters sub-portion of the fused counters + fcounters DataFrame.
How we store metadata (job runtime and stuff...), and avoid having to hard-code magic numbers like that for the number of columns for different modules/counters is probably a design question.
* minor mypy shims to get type checking passing again
recs = report.data["records"] | ||
df_posix_counters = recs["POSIX"].to_df()["counters"] | ||
df_posix_fcounters = recs["POSIX"].to_df()["fcounters"] | ||
# NOTE: is it always true that counters and counters will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# NOTE: is it always true that counters and counters will | |
# NOTE: is it always true that counters and fcounters will |
Neat! Can you explain how to run the converter and/or share the example parquet file? |
A Python script like this should do the trick locally, if you're all setup on this feature branch with the logs repo installed as well, etc. from darshan.log_utils import get_log_path, convert_to_parquet
log_path = get_log_path("runtime_and_dxt_heatmaps_diagonal_write_only.darshan")
convert_to_parquet(log_path, "output.parquet.gzip") Then produce the HTML report as usual with the parquet file. Obviously only |
We've had some discussions about supporting a parquet or arrow-like format to avoid the various idiosyncrancies and performance issues related to the in-house binary format, possibly through a converter of the binary format to parquet format. This is more of a demo than something that is meant for serious code review, for now...
a quick demo with POSIX-only support for a single summary report table for parquet input, and a converter that only supports POSIX and was only tested on a single log file
this does appear to allow the full test suite to pass while adding incredibly-crude summary report support for working with a parquet file that has POSIX counter/fcounter data
there are a few reasons to demo this:
TODO
comments I added around try/except blocks should give a good indicator of the number of places in the code where changes would be needed to produce a more complete summary report from parquet inputThe example below shows what happens when producing the summary report with the 1 parquet file I tested with. It correctly reproduces a single table in the report since that is all I added support for, for now. Perhaps the other notable observation is that the gzipped parquet file is about 7X larger than the native binary file, and the native binary also contains more raw data because we're currently excluding
DXT_POSIX
for the parquet format, for now. I don't consider file size/compression a priority at this stage of development/consideration though.python -m darshan summary /Users/treddy/rough_work/darshan/test_parquet/runtime_and_dxt_heatmaps_diagonal_write_only.parquet.gzip