ENH: file count summary #868

tylerjereddy · 2022-12-06T21:12:03Z

this builds on ENH: Python derived/accum interface #839, and that PR should
be reviewed/merged first
add the "File Count Summary" to the Python
summary reports based on the table with the same
name in the Perl reports; note considerable
discussion in this branch/code comments about differences
with the old reported values, and see also
file category count mismatch between darshan-parser and darshan-job-summary.pl #867
an additional contrast with the old Perl reports
is that the above table is reported for each module
rather than "just POSIX"
some other considerations:
- unlike the parent PR/branch, this branch aims to start
  moving new accumulator infrastructure/testing to a more
  appropriate location than the CFFI backend modules (since
  much of the work is higher-level than that, etc.)
- similar to parent PR, we may want to do some checking for
  memory leaks because they are just so easy to introduce with the
  poor static analysis/tooling around the CFFI barrier
- we can probably bikeshed on the table formatting a bit; I'll paste
  a few samples from Python summary reports below
- we have a new dependency on humanize here--not sure on tradeoffs
  there, we've discussed using this lib for file size conversions
  before though
- like the parent PR, type hints and docstrings for new function(s)
  probably could use some work

Samples from imbalanced-io.darshan:

* early draft of Python/CFFI interface to derived metrics/accumulators described in: - darshan-hpcgh-642 - darshan-hpcgh-677 * for now this definitely doesn't work, and feels like I'm basically reconstituting a C control flow in CFFI/Python instead of using a sensible exposure point between C and Python to pull out populated structs from a single entry point * perhaps folks can just help me sort out the current issues noted in the source changes rather than providing a convenient API, though once thorough regression tests are in place that might be something to consider in the future... (or even just maintaining it in `pandas`/Python someday if the perf is ~similar)

* needed to expose `darshan_file_category_counters` so that its size is available for usage in `darshan_derived_metrics` struct * some improvements to the renamed `log_get_derived_metrics()` function, including first draft of error handling; the renamed `test_derived_metrics_basic` test was also adjusted to expect an error for `LUSTRE` module

- for record buffer in accumulator test - for error path in another test

* `log_get_derived_metrics()` was adjusted to inject all the records for a given module, because this is our initial target to replicate the stats in the old perl summary report * a new `log_get_bytes_bandwidth()` function was drafted in as a convenience wrapper to get MiB (total bytes) and bandwidth (MiB/s) values printed out in the old perl report * renamed the regression test for this PR and adjusted it to compare against the bytes/bandwidth strings present in the perl reports; so far, only a small subset of the STDIO results are working properly (see the xfails in this test..)

* `log_get_bytes_bandwidth()` has been simplified substantially as it no longer needs to generate a new `report` object, and can instead use the `agg_perf_by_slowest` structure member * the `xfail` marks have all been removed from `test_derived_metrics_bytes_and_bandwidth()`--those cases all pass now thanks to the above changes * remove some debug prints and unused code from `log_get_derived_metrics()`

* a number of additional `MPI-IO` and `STDIO` test cases were added from the logs repo to `test_derived_metrics_bytes_and_bandwidth()` * for the `MPI-IO` cases to pass, special casing was added to `log_get_bytes_bandwidth()` such that `total_bytes` is actually extracted from `POSIX`

* removed an invalid `darshan_free()` from `log_get_derived_metrics()`-- the `buf` object didn't even exist at that point in the control flow * add a `LUSTRE` test case, which raises a `RuntimeError` as expected * add a tentatie `POSIX` test case, which reports a bandwidth string at the Python level, but is not included in the Perl summary reports...

* when `log_get_derived_metrics()` receives a module name that doesn't exist in the log file it received, it will now raise a `ValueError` for clarity of feedback * update `test_derived_metrics_bytes_and_bandwidth()` accordingly, and also start regex matching on expected error messages in this test

* add the bandwidth summary string to the Python report proper, and include a test for the presence of this string in logs repo-based summary reports

* add one of the tricky `APMPI` cases I discovered to `test_derived_metrics_bytes_and_bandwidth()`, pending discussion with team re: how I should handle this

* adjust tests to more closely match `darshan-parser` instead of the old Perl report in cases where MPI-IO and POSIX are both involved; this allows me to remove the weird MPI-IO shim in `log_get_bytes_bandwidth()`

* the bandwidth text in the Python summary report is now colored "blue," along with a regression test, based on reviewer feedback * added `skew-app.darshan` log to `test_derived_metrics_bytes_and_bandwidth()`--we get the same results as `darshan-parser` * replaced the `xfail` for `e3sm_io_heatmap_only.darshan` with an expected `KeyError` when handling `APMPI` (this should already be handled gracefully/ignored by the Python summary report)

* the testsuite now always uses `DarshanReport` with a context manager to avoid shenanigans with `__del__` and garbage collection/`pytest`/multiple threads * this appears to fix the problem with testsuite hangs described in darshan-hpcgh-839 and darshan-hpcgh-851

* this builds on darshan-hpcgh-839, and that PR should be reviewed/merged first * add the "File Count Summary" to the Python summary reports based on the table with the same name in the Perl reports; note considerable discussion in this branch/code comments about differences with the old reported values, and see also darshan-hpcgh-867 * an additional contrast with the old Perl reports is that the above table is reported for each module rather than "just POSIX" * some other considerations: - unlike the parent PR/branch, this branch aims to start moving new accumulator infrastructure/testing to a more appropriate location than the CFFI backend modules (since much of the work is higher-level than that, etc.) - similar to parent PR, we may want to do some checking for memory leaks because they are just so easy to introduce with the poor static analysis/tooling around the CFFI barrier - we can probably bikeshed on the table formatting a bit; I'll paste a few samples from Python summary reports below - we have a new dependency on `humanize` here--not sure on tradeoffs there, we've discussed using this lib for file size conversions before though - like the parent PR, type hints and docstrings for new function(s) probably could use some work

tylerjereddy · 2022-12-06T21:17:39Z

CI error is Version 3.6 was not found in the local cache -- I think it really is time to dump 3.6 support if they're starting to do brownouts, etc. That version has been completely EOL for a full year--it really is time for HPC centers to upgrade or be left behind on that one I'd say.

carns · 2022-12-06T22:16:28Z

I'm on board with dropping Python 3.6 support now.

carns · 2022-12-06T22:17:50Z

See #867 ; the top row of the tables should be labeled "total files" rather than "total opened" now be more technically accurate with what the accumulator API is reporting. There was a subtle difference in what the old perl code was calculating for this table.

darshan-util/pydarshan/darshan/backend/cffi_backend.py

carns

LGTM

darshan-util/pydarshan/darshan/lib/accum.py

* `cffi_backend` module changes requested from PR review - remove a spurious `darshan_free` from `_log_get_heatmap_record()` - fix the scoping of the `darshan_free` of `buf` object used with `darshan_accumulator_inject` in `log_get_derived_metrics` - adding a missing `log_close()` to `log_get_derived_metrics` (maybe we can wrap in Python contexts in the future though) - use a separate buffer for `darshan_accumulator_emit()` inside `log_get_derived_metrics` * note that making the above CFFI/free-related changes caused a segfault in the testuite, so in the end I adjusted the location of the memory freeing as I saw fit to avoid segfaults--I'd say at this point please provide concrete evidence with a memory leak plot or failing test for additional adjustments there, or just push the change in * in the end, there is a slightly more concise usage of `darshan_free()` but no meaningful change in the free operations * I also reverted the suggested changed to `darshan_accumulator_emit()` usage--there was no testable evidence of an issue, and it was also causing segfaults.. * address many of the discussion points that came up in darshan-hpcgh-868: - `log_get_derived_metrics()` now uses an LRU cache, which effectively means that we use memoization to return derived metrics data rather than doing another pass over the log file if the same log path and module name have already been accumulated from; we still need to pass over a given log twice in most cases--once at initial read-in and once for using `log_get_derived_metrics`; how we decide to add filtering of records prior to accumulation interface in Python is probably a deeper discussion/for later - `log_get_bytes_bandwidth()` and its associated testing have been migrated to modules not named after "CFFI", like the in the above PR, because I think we should only use the "CFFI" named modules for direct CFFI interaction/testing, and for other analyses we should probably use more distinct names. Also, to some extent everything depends on the CFFI layer, so trying to restrict "CFFI" modules to direct rather than direct interaction will help keep them manageably sized, especially given the proclivity for surprising memory issues/segfaults in those parts of the code. - add a proper docstring with examples for `log_get_bytes_bandwidth()`

* move `log_get_bytes_bandwidth()` out of CFFI module to darshan/lib/accum.py * adopt LRU cache and other cleanup tweaks for log_get_derived_metrics()

shanedsnyder · 2023-01-27T23:32:45Z

I think I have this all synced with changes from #839. I need to review the lib/accum code and the tests closer, but everything else looks good so far.

darshan-util/pydarshan/darshan/lib/accum.py

darshan-util/pydarshan/darshan/tests/test_lib_accum.py

shanedsnyder · 2023-02-15T21:41:03Z

I just merged main back here, including some edits to use new derived metrics interfaces appropriately in the lib code and in the tests.

For me, the lib tests specifically are notably slower using this new approach (1.62 seconds for old approach, 25.33 seconds for new approach), which from some quick profiling appears to be mostly related to the need to convert record data into a dataframe. That is known to be pretty expensive, especially for logs with lots of records. We should try to improve the overheads of that separately, but at least the new code paths related to converting dataframes to raw buffers and passing into the C library for accumulation don't appear to be any bottleneck.

shanedsnyder · 2023-02-15T22:57:11Z

Any objections if I make the rounding change I mentioned (and update affected tests), and then merge this?

tylerjereddy · 2023-02-15T23:37:21Z

Any objections if I make the rounding change I mentioned (and update affected tests), and then merge this?

No objections from me--the diff looks fine/CI is currently passing, and I don't have an opinion on how you should round floats in the summary report, so I'd say go for it.

The performance stuff is going to be harder--we keep bumping up against the log format only releasing one record at time, which is just hitting "don't iterate in Python" performance guidance basically.

stale

tylerjereddy and others added 17 commits November 30, 2022 15:52

fix pydarshan defs for darshan-util accumulators

3e9afc8

bug fix: missing array in derived metrics struct

0ef128e

silence warning: convert enum arg to int

8ead7ad

added some free wrapper calls

dbd31a8

- for record buffer in accumulator test - for error path in another test

MAINT: PR 839 revisions

baee661

* add the bandwidth summary string to the Python report proper, and include a test for the presence of this string in logs repo-based summary reports

MAINT: PR 839 revisions

99d74e3

* add one of the tricky `APMPI` cases I discovered to `test_derived_metrics_bytes_and_bandwidth()`, pending discussion with team re: how I should handle this

MAINT: PR 839 revisions

8f6676f

* adjust tests to more closely match `darshan-parser` instead of the old Perl report in cases where MPI-IO and POSIX are both involved; this allows me to remove the weird MPI-IO shim in `log_get_bytes_bandwidth()`

tylerjereddy added enhancement New feature or request pydarshan labels Dec 6, 2022

github-actions bot added the CI continuous integration label Dec 6, 2022

This was referenced Dec 6, 2022

MAINT: when Python 3.6 is dropped #510

Closed

file category count mismatch between darshan-parser and darshan-job-summary.pl #867

Closed

carns reviewed Dec 6, 2022

View reviewed changes

darshan-util/pydarshan/darshan/backend/cffi_backend.py Outdated Show resolved Hide resolved

carns approved these changes Dec 8, 2022

View reviewed changes

carns previously requested changes Dec 8, 2022

View reviewed changes

darshan-util/pydarshan/darshan/lib/accum.py Outdated Show resolved Hide resolved

tylerjereddy mentioned this pull request Dec 16, 2022

ENH: Python derived/accum interface #839

Merged

shanedsnyder added 4 commits January 27, 2023 15:43

add humanize to pyproject.toml

9b265ff

refactor code to match darshan-hpcgh-839

c2dc722

* move `log_get_bytes_bandwidth()` out of CFFI module to darshan/lib/accum.py * adopt LRU cache and other cleanup tweaks for log_get_derived_metrics()

fix import of log_get_bytes_bandwidth

fc78e3d

moved derived metrics tests out of CFFI module

715e0df

shanedsnyder requested changes Jan 30, 2023

View reviewed changes

shanedsnyder added 2 commits January 30, 2023 17:58

"total opened" -> "total files"

c61a024

forgot to fix test

060fd03

tylerjereddy mentioned this pull request Feb 1, 2023

MAINT: move DarshanReportTable to a more general location #892

Open

Merge branch 'main' into treddy_derived_metrics_2

3a3baa8

remove stale import

3913c42

shanedsnyder added 2 commits February 16, 2023 10:45

round file category results to 2 decimals

aff9aaa

cleanup

cc26ea0

shanedsnyder previously approved these changes Feb 16, 2023

View reviewed changes

add humanize to setup.py

0751940

shanedsnyder dismissed their stale review via 0751940 February 16, 2023 16:57

shanedsnyder approved these changes Feb 16, 2023

View reviewed changes

shanedsnyder merged commit 1f5f91b into darshan-hpc:main Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: file count summary #868

ENH: file count summary #868

tylerjereddy commented Dec 6, 2022

tylerjereddy commented Dec 6, 2022

carns commented Dec 6, 2022

carns commented Dec 6, 2022

carns left a comment

shanedsnyder commented Jan 27, 2023

shanedsnyder commented Feb 15, 2023

shanedsnyder commented Feb 15, 2023

tylerjereddy commented Feb 15, 2023

ENH: file count summary #868

ENH: file count summary #868

Conversation

tylerjereddy commented Dec 6, 2022

tylerjereddy commented Dec 6, 2022

carns commented Dec 6, 2022

carns commented Dec 6, 2022

carns left a comment

Choose a reason for hiding this comment

shanedsnyder commented Jan 27, 2023

shanedsnyder commented Feb 15, 2023

shanedsnyder commented Feb 15, 2023

tylerjereddy commented Feb 15, 2023