-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: file count summary #868
ENH: file count summary #868
Conversation
* early draft of Python/CFFI interface to derived metrics/accumulators described in: - darshan-hpcgh-642 - darshan-hpcgh-677 * for now this definitely doesn't work, and feels like I'm basically reconstituting a C control flow in CFFI/Python instead of using a sensible exposure point between C and Python to pull out populated structs from a single entry point * perhaps folks can just help me sort out the current issues noted in the source changes rather than providing a convenient API, though once thorough regression tests are in place that might be something to consider in the future... (or even just maintaining it in `pandas`/Python someday if the perf is ~similar)
* needed to expose `darshan_file_category_counters` so that its size is available for usage in `darshan_derived_metrics` struct * some improvements to the renamed `log_get_derived_metrics()` function, including first draft of error handling; the renamed `test_derived_metrics_basic` test was also adjusted to expect an error for `LUSTRE` module
- for record buffer in accumulator test - for error path in another test
* `log_get_derived_metrics()` was adjusted to inject all the records for a given module, because this is our initial target to replicate the stats in the old perl summary report * a new `log_get_bytes_bandwidth()` function was drafted in as a convenience wrapper to get MiB (total bytes) and bandwidth (MiB/s) values printed out in the old perl report * renamed the regression test for this PR and adjusted it to compare against the bytes/bandwidth strings present in the perl reports; so far, only a small subset of the STDIO results are working properly (see the xfails in this test..)
* `log_get_bytes_bandwidth()` has been simplified substantially as it no longer needs to generate a new `report` object, and can instead use the `agg_perf_by_slowest` structure member * the `xfail` marks have all been removed from `test_derived_metrics_bytes_and_bandwidth()`--those cases all pass now thanks to the above changes * remove some debug prints and unused code from `log_get_derived_metrics()`
* a number of additional `MPI-IO` and `STDIO` test cases were added from the logs repo to `test_derived_metrics_bytes_and_bandwidth()` * for the `MPI-IO` cases to pass, special casing was added to `log_get_bytes_bandwidth()` such that `total_bytes` is actually extracted from `POSIX`
* removed an invalid `darshan_free()` from `log_get_derived_metrics()`-- the `buf` object didn't even exist at that point in the control flow * add a `LUSTRE` test case, which raises a `RuntimeError` as expected * add a tentatie `POSIX` test case, which reports a bandwidth string at the Python level, but is not included in the Perl summary reports...
* when `log_get_derived_metrics()` receives a module name that doesn't exist in the log file it received, it will now raise a `ValueError` for clarity of feedback * update `test_derived_metrics_bytes_and_bandwidth()` accordingly, and also start regex matching on expected error messages in this test
* add the bandwidth summary string to the Python report proper, and include a test for the presence of this string in logs repo-based summary reports
* add one of the tricky `APMPI` cases I discovered to `test_derived_metrics_bytes_and_bandwidth()`, pending discussion with team re: how I should handle this
* adjust tests to more closely match `darshan-parser` instead of the old Perl report in cases where MPI-IO and POSIX are both involved; this allows me to remove the weird MPI-IO shim in `log_get_bytes_bandwidth()`
* the bandwidth text in the Python summary report is now colored "blue," along with a regression test, based on reviewer feedback * added `skew-app.darshan` log to `test_derived_metrics_bytes_and_bandwidth()`--we get the same results as `darshan-parser` * replaced the `xfail` for `e3sm_io_heatmap_only.darshan` with an expected `KeyError` when handling `APMPI` (this should already be handled gracefully/ignored by the Python summary report)
* the testsuite now always uses `DarshanReport` with a context manager to avoid shenanigans with `__del__` and garbage collection/`pytest`/multiple threads * this appears to fix the problem with testsuite hangs described in darshan-hpcgh-839 and darshan-hpcgh-851
* this builds on darshan-hpcgh-839, and that PR should be reviewed/merged first * add the "File Count Summary" to the Python summary reports based on the table with the same name in the Perl reports; note considerable discussion in this branch/code comments about differences with the old reported values, and see also darshan-hpcgh-867 * an additional contrast with the old Perl reports is that the above table is reported for each module rather than "just POSIX" * some other considerations: - unlike the parent PR/branch, this branch aims to start moving new accumulator infrastructure/testing to a more appropriate location than the CFFI backend modules (since much of the work is higher-level than that, etc.) - similar to parent PR, we may want to do some checking for memory leaks because they are just so easy to introduce with the poor static analysis/tooling around the CFFI barrier - we can probably bikeshed on the table formatting a bit; I'll paste a few samples from Python summary reports below - we have a new dependency on `humanize` here--not sure on tradeoffs there, we've discussed using this lib for file size conversions before though - like the parent PR, type hints and docstrings for new function(s) probably could use some work
CI error is |
I'm on board with dropping Python 3.6 support now. |
See #867 ; the top row of the tables should be labeled "total files" rather than "total opened" now be more technically accurate with what the accumulator API is reporting. There was a subtle difference in what the old perl code was calculating for this table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* `cffi_backend` module changes requested from PR review - remove a spurious `darshan_free` from `_log_get_heatmap_record()` - fix the scoping of the `darshan_free` of `buf` object used with `darshan_accumulator_inject` in `log_get_derived_metrics` - adding a missing `log_close()` to `log_get_derived_metrics` (maybe we can wrap in Python contexts in the future though) - use a separate buffer for `darshan_accumulator_emit()` inside `log_get_derived_metrics` * note that making the above CFFI/free-related changes caused a segfault in the testuite, so in the end I adjusted the location of the memory freeing as I saw fit to avoid segfaults--I'd say at this point please provide concrete evidence with a memory leak plot or failing test for additional adjustments there, or just push the change in * in the end, there is a slightly more concise usage of `darshan_free()` but no meaningful change in the free operations * I also reverted the suggested changed to `darshan_accumulator_emit()` usage--there was no testable evidence of an issue, and it was also causing segfaults.. * address many of the discussion points that came up in darshan-hpcgh-868: - `log_get_derived_metrics()` now uses an LRU cache, which effectively means that we use memoization to return derived metrics data rather than doing another pass over the log file if the same log path and module name have already been accumulated from; we still need to pass over a given log twice in most cases--once at initial read-in and once for using `log_get_derived_metrics`; how we decide to add filtering of records prior to accumulation interface in Python is probably a deeper discussion/for later - `log_get_bytes_bandwidth()` and its associated testing have been migrated to modules not named after "CFFI", like the in the above PR, because I think we should only use the "CFFI" named modules for direct CFFI interaction/testing, and for other analyses we should probably use more distinct names. Also, to some extent everything depends on the CFFI layer, so trying to restrict "CFFI" modules to direct rather than direct interaction will help keep them manageably sized, especially given the proclivity for surprising memory issues/segfaults in those parts of the code. - add a proper docstring with examples for `log_get_bytes_bandwidth()`
* `cffi_backend` module changes requested from PR review - remove a spurious `darshan_free` from `_log_get_heatmap_record()` - fix the scoping of the `darshan_free` of `buf` object used with `darshan_accumulator_inject` in `log_get_derived_metrics` - adding a missing `log_close()` to `log_get_derived_metrics` (maybe we can wrap in Python contexts in the future though) - use a separate buffer for `darshan_accumulator_emit()` inside `log_get_derived_metrics` * note that making the above CFFI/free-related changes caused a segfault in the testuite, so in the end I adjusted the location of the memory freeing as I saw fit to avoid segfaults--I'd say at this point please provide concrete evidence with a memory leak plot or failing test for additional adjustments there, or just push the change in * in the end, there is a slightly more concise usage of `darshan_free()` but no meaningful change in the free operations * I also reverted the suggested changed to `darshan_accumulator_emit()` usage--there was no testable evidence of an issue, and it was also causing segfaults.. * address many of the discussion points that came up in darshan-hpcgh-868: - `log_get_derived_metrics()` now uses an LRU cache, which effectively means that we use memoization to return derived metrics data rather than doing another pass over the log file if the same log path and module name have already been accumulated from; we still need to pass over a given log twice in most cases--once at initial read-in and once for using `log_get_derived_metrics`; how we decide to add filtering of records prior to accumulation interface in Python is probably a deeper discussion/for later - `log_get_bytes_bandwidth()` and its associated testing have been migrated to modules not named after "CFFI", like the in the above PR, because I think we should only use the "CFFI" named modules for direct CFFI interaction/testing, and for other analyses we should probably use more distinct names. Also, to some extent everything depends on the CFFI layer, so trying to restrict "CFFI" modules to direct rather than direct interaction will help keep them manageably sized, especially given the proclivity for surprising memory issues/segfaults in those parts of the code. - add a proper docstring with examples for `log_get_bytes_bandwidth()`
* move `log_get_bytes_bandwidth()` out of CFFI module to darshan/lib/accum.py * adopt LRU cache and other cleanup tweaks for log_get_derived_metrics()
I think I have this all synced with changes from #839. I need to review the lib/accum code and the tests closer, but everything else looks good so far. |
I just merged For me, the |
Any objections if I make the rounding change I mentioned (and update affected tests), and then merge this? |
No objections from me--the diff looks fine/CI is currently passing, and I don't have an opinion on how you should round floats in the summary report, so I'd say go for it. The performance stuff is going to be harder--we keep bumping up against the log format only releasing one record at time, which is just hitting "don't iterate in Python" performance guidance basically. |
this builds on ENH: Python derived/accum interface #839, and that PR should
be reviewed/merged first
add the "File Count Summary" to the Python
summary reports based on the table with the same
name in the Perl reports; note considerable
discussion in this branch/code comments about differences
with the old reported values, and see also
file category count mismatch between darshan-parser and darshan-job-summary.pl #867
an additional contrast with the old Perl reports
is that the above table is reported for each module
rather than "just POSIX"
some other considerations:
moving new accumulator infrastructure/testing to a more
appropriate location than the CFFI backend modules (since
much of the work is higher-level than that, etc.)
memory leaks because they are just so easy to introduce with the
poor static analysis/tooling around the CFFI barrier
a few samples from Python summary reports below
humanize
here--not sure on tradeoffsthere, we've discussed using this lib for file size conversions
before though
probably could use some work
Samples from
imbalanced-io.darshan
: