Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP, ENH: parquet demo #929

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tylerjereddy
Copy link
Collaborator

@tylerjereddy tylerjereddy commented May 6, 2023

We've had some discussions about supporting a parquet or arrow-like format to avoid the various idiosyncrancies and performance issues related to the in-house binary format, possibly through a converter of the binary format to parquet format. This is more of a demo than something that is meant for serious code review, for now...

  • a quick demo with POSIX-only support for a single summary report table for parquet input, and a converter that only supports POSIX and was only tested on a single log file

  • this does appear to allow the full test suite to pass while adding incredibly-crude summary report support for working with a parquet file that has POSIX counter/fcounter data

  • there are a few reasons to demo this:

  1. It may help spark some discussion about how this should work because I already made some potentially-controversial decisions like concatenating along the columns to fuse counter and fcounters
  2. The various TODO comments I added around try/except blocks should give a good indicator of the number of places in the code where changes would be needed to produce a more complete summary report from parquet input
  3. Sometimes it is easier to develop from a (crude) prototype if a summer student picks this up (vs. from scratch)

The example below shows what happens when producing the summary report with the 1 parquet file I tested with. It correctly reproduces a single table in the report since that is all I added support for, for now. Perhaps the other notable observation is that the gzipped parquet file is about 7X larger than the native binary file, and the native binary also contains more raw data because we're currently excluding DXT_POSIX for the parquet format, for now. I don't consider file size/compression a priority at this stage of development/consideration though.

python -m darshan summary /Users/treddy/rough_work/darshan/test_parquet/runtime_and_dxt_heatmaps_diagonal_write_only.parquet.gzip

image

* a quick demo with POSIX-only support for a single
summary report plot for parquet input, and a converter
that only support POSIX and was only tested on a single log
file

* this does appear to allow the full test suite to pass while
adding incredeibly-crude summary report support for working
with a parquet file that has POSIX counter/fcounter data

* there are a few reasons to demo this:

1) It may help spark some discussion about how this should work
because I already made some potentially-controversial decisions
like concatenating along the columns to fuse counter and fcounters
2) The various `TODO` comments I added around try/except blocks
should give a good indicator of the number of places in the code
where changes would be needed to produce a more complete summary
report from parquet input
3) Sometimes it is easier to develop from a (crude) prototype if
a summer student picks this up (vs. from scratch)
mod_df = report.records[mod].to_df(attach=None)["counters"]
except AttributeError:
# TODO: fix for parquet format
mod_df = report.iloc[..., :71]
Copy link
Collaborator Author

@tylerjereddy tylerjereddy May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one table I added support for in the parquet-based summary report, since it was so easy. You just take the dataframe you loaded in from the parquet file, and slice it to get the counters sub-portion of the fused counters + fcounters DataFrame.

How we store metadata (job runtime and stuff...), and avoid having to hard-code magic numbers like that for the number of columns for different modules/counters is probably a design question.

* minor mypy shims to get type checking
passing again
recs = report.data["records"]
df_posix_counters = recs["POSIX"].to_df()["counters"]
df_posix_fcounters = recs["POSIX"].to_df()["fcounters"]
# NOTE: is it always true that counters and counters will
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# NOTE: is it always true that counters and counters will
# NOTE: is it always true that counters and fcounters will

@carns
Copy link
Contributor

carns commented May 9, 2023

Neat! Can you explain how to run the converter and/or share the example parquet file?

@tylerjereddy
Copy link
Collaborator Author

Neat! Can you explain how to run the converter and/or share the example parquet file?

A Python script like this should do the trick locally, if you're all setup on this feature branch with the logs repo installed as well, etc.

from darshan.log_utils import get_log_path, convert_to_parquet

log_path = get_log_path("runtime_and_dxt_heatmaps_diagonal_write_only.darshan")
convert_to_parquet(log_path, "output.parquet.gzip")

Then produce the HTML report as usual with the parquet file. Obviously only POSIX is handled at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants