Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: file count summary #868

Merged
merged 28 commits into from
Feb 16, 2023

Conversation

tylerjereddy
Copy link
Collaborator

  • this builds on ENH: Python derived/accum interface #839, and that PR should
    be reviewed/merged first

  • add the "File Count Summary" to the Python
    summary reports based on the table with the same
    name in the Perl reports; note considerable
    discussion in this branch/code comments about differences
    with the old reported values, and see also
    file category count mismatch between darshan-parser and darshan-job-summary.pl #867

  • an additional contrast with the old Perl reports
    is that the above table is reported for each module
    rather than "just POSIX"

  • some other considerations:

    • unlike the parent PR/branch, this branch aims to start
      moving new accumulator infrastructure/testing to a more
      appropriate location than the CFFI backend modules (since
      much of the work is higher-level than that, etc.)
    • similar to parent PR, we may want to do some checking for
      memory leaks because they are just so easy to introduce with the
      poor static analysis/tooling around the CFFI barrier
    • we can probably bikeshed on the table formatting a bit; I'll paste
      a few samples from Python summary reports below
    • we have a new dependency on humanize here--not sure on tradeoffs
      there, we've discussed using this lib for file size conversions
      before though
    • like the parent PR, type hints and docstrings for new function(s)
      probably could use some work

Samples from imbalanced-io.darshan:

image

image

image

tylerjereddy and others added 17 commits November 30, 2022 15:52
* early draft of Python/CFFI interface to
derived metrics/accumulators described in:
- darshan-hpcgh-642
- darshan-hpcgh-677

* for now this definitely doesn't work, and feels like I'm basically
reconstituting a C control flow in CFFI/Python instead of using a sensible
exposure point between C and Python to pull out populated structs from
a single entry point

* perhaps folks can just help me sort out the current issues noted
in the source changes rather than providing a convenient API, though
once thorough regression tests are in place that might be something
to consider in the future... (or even just maintaining it in
`pandas`/Python someday if the perf is ~similar)
* needed to expose `darshan_file_category_counters` so that its
size is available for usage in `darshan_derived_metrics` struct

* some improvements to the renamed `log_get_derived_metrics()` function,
including first draft of error handling; the renamed
`test_derived_metrics_basic` test was also adjusted to expect an
error for `LUSTRE` module
- for record buffer in accumulator test
- for error path in another test
* `log_get_derived_metrics()` was adjusted to inject
all the records for a given module, because this is our
initial target to replicate the stats in the old
perl summary report

* a new `log_get_bytes_bandwidth()` function was drafted
in as a convenience wrapper to get MiB (total bytes) and
bandwidth (MiB/s) values printed out in the old perl report

* renamed the regression test for this PR and adjusted
it to compare against the bytes/bandwidth strings
present in the perl reports; so far, only a small
subset of the STDIO results are working properly
(see the xfails in this test..)
* `log_get_bytes_bandwidth()` has been simplified
substantially as it no longer needs to generate a
new `report` object, and can instead use the
`agg_perf_by_slowest` structure member

* the `xfail` marks have all been removed from
`test_derived_metrics_bytes_and_bandwidth()`--those
cases all pass now thanks to the above changes

* remove some debug prints and unused code from
`log_get_derived_metrics()`
* a number of additional `MPI-IO` and `STDIO` test cases
were added from the logs repo to `test_derived_metrics_bytes_and_bandwidth()`

* for the `MPI-IO` cases to pass, special casing was added
to `log_get_bytes_bandwidth()` such that `total_bytes` is
actually extracted from `POSIX`
* removed an invalid `darshan_free()` from `log_get_derived_metrics()`--
the `buf` object didn't even exist at that point in the control flow

* add a `LUSTRE` test case, which raises a `RuntimeError` as expected

* add a tentatie `POSIX` test case, which reports a bandwidth string
at the Python level, but is not included in the Perl summary reports...
* when `log_get_derived_metrics()` receives a
module name that doesn't exist in the log
file it received, it will now raise a `ValueError`
for clarity of feedback

* update `test_derived_metrics_bytes_and_bandwidth()`
accordingly, and also start regex matching on expected
error messages in this test
* add the bandwidth summary string to the Python
report proper, and include a test for the presence
of this string in logs repo-based summary reports
* add one of the tricky `APMPI` cases I discovered
to `test_derived_metrics_bytes_and_bandwidth()`, pending
discussion with team re: how I should handle this
* adjust tests to more closely match `darshan-parser` instead
of the old Perl report in cases where MPI-IO and POSIX are
both involved; this allows me to remove the weird MPI-IO
shim in `log_get_bytes_bandwidth()`
* the bandwidth text in the Python summary report is now
colored "blue," along with a regression test, based on
reviewer feedback

* added `skew-app.darshan` log to
`test_derived_metrics_bytes_and_bandwidth()`--we get the same
results as `darshan-parser`

* replaced the `xfail` for `e3sm_io_heatmap_only.darshan` with
an expected `KeyError` when handling `APMPI` (this should already
be handled gracefully/ignored by the Python summary report)
* the testsuite now always uses `DarshanReport` with a context
manager to avoid shenanigans with `__del__` and garbage
collection/`pytest`/multiple threads

* this appears to fix the problem with testsuite hangs
described in darshan-hpcgh-839 and darshan-hpcgh-851
* this builds on darshan-hpcgh-839, and that PR should
be reviewed/merged first

* add the "File Count Summary" to the Python
summary reports based on the table with the same
name in the Perl reports; note considerable
discussion in this branch/code comments about differences
with the old reported values, and see also
darshan-hpcgh-867

* an additional contrast with the old Perl reports
is that the above table is reported for each module
rather than "just POSIX"

* some other considerations:
  - unlike the parent PR/branch, this branch aims to start
    moving new accumulator infrastructure/testing to a more
    appropriate location than the CFFI backend modules (since
    much of the work is higher-level than that, etc.)
  - similar to parent PR, we may want to do some checking for
    memory leaks because they are just so easy to introduce with the
    poor static analysis/tooling around the CFFI barrier
  - we can probably bikeshed on the table formatting a bit; I'll paste
    a few samples from Python summary reports below
  - we have a new dependency on `humanize` here--not sure on tradeoffs
    there, we've discussed using this lib for file size conversions
    before though
  - like the parent PR, type hints and docstrings for new function(s)
    probably could use some work
@tylerjereddy tylerjereddy added enhancement New feature or request pydarshan labels Dec 6, 2022
@github-actions github-actions bot added the CI continuous integration label Dec 6, 2022
@tylerjereddy
Copy link
Collaborator Author

CI error is Version 3.6 was not found in the local cache -- I think it really is time to dump 3.6 support if they're starting to do brownouts, etc. That version has been completely EOL for a full year--it really is time for HPC centers to upgrade or be left behind on that one I'd say.

@carns
Copy link
Contributor

carns commented Dec 6, 2022

I'm on board with dropping Python 3.6 support now.

@carns
Copy link
Contributor

carns commented Dec 6, 2022

See #867 ; the top row of the tables should be labeled "total files" rather than "total opened" now be more technically accurate with what the accumulator API is reporting. There was a subtle difference in what the old perl code was calculating for this table.

Copy link
Contributor

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

carns
carns previously requested changes Dec 8, 2022
darshan-util/pydarshan/darshan/lib/accum.py Outdated Show resolved Hide resolved
tylerjereddy added a commit to tylerjereddy/darshan that referenced this pull request Dec 16, 2022
* `cffi_backend` module changes requested from PR review
  - remove a spurious `darshan_free` from `_log_get_heatmap_record()`
  - fix the scoping of the `darshan_free` of `buf` object used with
    `darshan_accumulator_inject` in `log_get_derived_metrics`
  - adding a missing `log_close()` to `log_get_derived_metrics` (maybe
    we can wrap in Python contexts in the future though)
  - use a separate buffer for `darshan_accumulator_emit()` inside
    `log_get_derived_metrics`

* note that making the above CFFI/free-related changes caused
a segfault in the testuite, so in the end I adjusted the location
of the memory freeing as I saw fit to avoid segfaults--I'd say at this
point please provide concrete evidence with a memory leak plot or
failing test for additional adjustments there, or just push the change
in

* in the end, there is a slightly more concise usage of `darshan_free()`
but no meaningful change in the free operations

* I also reverted the suggested changed to `darshan_accumulator_emit()`
usage--there was no testable evidence of an issue, and it was also
causing segfaults..

* address many of the discussion points that came up in darshan-hpcgh-868:
  - `log_get_derived_metrics()` now uses an LRU cache, which effectively
    means that we use memoization to return derived metrics data
    rather than doing another pass over the log file if the same
    log path and module name have already been accumulated from; we
    still need to pass over a given log twice in most cases--once at
    initial read-in and once for using `log_get_derived_metrics`; how
    we decide to add filtering of records prior to accumulation
    interface in Python is probably a deeper discussion/for later
  - `log_get_bytes_bandwidth()` and its associated testing have been
     migrated to modules not named after "CFFI", like the in the above
     PR, because I think we should only use the "CFFI" named modules
     for direct CFFI interaction/testing, and for other analyses we
     should probably use more distinct names. Also, to some extent
     everything depends on the CFFI layer, so trying to restrict
     "CFFI" modules to direct rather than direct interaction will
     help keep them manageably sized, especially given the proclivity
     for surprising memory issues/segfaults in those parts of the code.
  - add a proper docstring with examples for `log_get_bytes_bandwidth()`
tylerjereddy added a commit to tylerjereddy/darshan that referenced this pull request Dec 16, 2022
* `cffi_backend` module changes requested from PR review
  - remove a spurious `darshan_free` from `_log_get_heatmap_record()`
  - fix the scoping of the `darshan_free` of `buf` object used with
    `darshan_accumulator_inject` in `log_get_derived_metrics`
  - adding a missing `log_close()` to `log_get_derived_metrics` (maybe
    we can wrap in Python contexts in the future though)
  - use a separate buffer for `darshan_accumulator_emit()` inside
    `log_get_derived_metrics`

* note that making the above CFFI/free-related changes caused
a segfault in the testuite, so in the end I adjusted the location
of the memory freeing as I saw fit to avoid segfaults--I'd say at this
point please provide concrete evidence with a memory leak plot or
failing test for additional adjustments there, or just push the change
in

* in the end, there is a slightly more concise usage of `darshan_free()`
but no meaningful change in the free operations

* I also reverted the suggested changed to `darshan_accumulator_emit()`
usage--there was no testable evidence of an issue, and it was also
causing segfaults..

* address many of the discussion points that came up in darshan-hpcgh-868:
  - `log_get_derived_metrics()` now uses an LRU cache, which effectively
    means that we use memoization to return derived metrics data
    rather than doing another pass over the log file if the same
    log path and module name have already been accumulated from; we
    still need to pass over a given log twice in most cases--once at
    initial read-in and once for using `log_get_derived_metrics`; how
    we decide to add filtering of records prior to accumulation
    interface in Python is probably a deeper discussion/for later
  - `log_get_bytes_bandwidth()` and its associated testing have been
     migrated to modules not named after "CFFI", like the in the above
     PR, because I think we should only use the "CFFI" named modules
     for direct CFFI interaction/testing, and for other analyses we
     should probably use more distinct names. Also, to some extent
     everything depends on the CFFI layer, so trying to restrict
     "CFFI" modules to direct rather than direct interaction will
     help keep them manageably sized, especially given the proclivity
     for surprising memory issues/segfaults in those parts of the code.
  - add a proper docstring with examples for `log_get_bytes_bandwidth()`
* move `log_get_bytes_bandwidth()` out of CFFI module to
  darshan/lib/accum.py
* adopt LRU cache and other cleanup tweaks for
  log_get_derived_metrics()
@shanedsnyder
Copy link
Contributor

I think I have this all synced with changes from #839. I need to review the lib/accum code and the tests closer, but everything else looks good so far.

darshan-util/pydarshan/darshan/lib/accum.py Show resolved Hide resolved
darshan-util/pydarshan/darshan/lib/accum.py Outdated Show resolved Hide resolved
darshan-util/pydarshan/darshan/lib/accum.py Outdated Show resolved Hide resolved
darshan-util/pydarshan/darshan/tests/test_lib_accum.py Outdated Show resolved Hide resolved
@shanedsnyder
Copy link
Contributor

I just merged main back here, including some edits to use new derived metrics interfaces appropriately in the lib code and in the tests.

For me, the lib tests specifically are notably slower using this new approach (1.62 seconds for old approach, 25.33 seconds for new approach), which from some quick profiling appears to be mostly related to the need to convert record data into a dataframe. That is known to be pretty expensive, especially for logs with lots of records. We should try to improve the overheads of that separately, but at least the new code paths related to converting dataframes to raw buffers and passing into the C library for accumulation don't appear to be any bottleneck.

@shanedsnyder
Copy link
Contributor

Any objections if I make the rounding change I mentioned (and update affected tests), and then merge this?

@tylerjereddy
Copy link
Collaborator Author

Any objections if I make the rounding change I mentioned (and update affected tests), and then merge this?

No objections from me--the diff looks fine/CI is currently passing, and I don't have an opinion on how you should round floats in the summary report, so I'd say go for it.

The performance stuff is going to be harder--we keep bumping up against the log format only releasing one record at time, which is just hitting "don't iterate in Python" performance guidance basically.

shanedsnyder
shanedsnyder previously approved these changes Feb 16, 2023
@shanedsnyder shanedsnyder merged commit 1f5f91b into darshan-hpc:main Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI continuous integration enhancement New feature or request pydarshan
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants