Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor how we interface with performance instrumentation #969

Merged
merged 19 commits into from
Jan 12, 2024

Conversation

jdolence
Copy link
Collaborator

@jdolence jdolence commented Oct 31, 2023

PR Summary

This PR does primarily two things:

  1. It introduces a couple macros that replace our direct calls to Kokkos::Profiling::pushRegion and Kokkos::Profiling::popRegion. Right now, the PARTHENON_INSTRUMENT macro instantiates a (new) KokkosTimer object that pushes a region on construction, with an auto-generated name of the form name_of_file.cpp::line_number::function_name, and then pops the region when the object is destroyed when it goes out of scope. The second macro, PARTHENON_INSTRUMENT_REGION(name), does the same thing but takes a name argument so you can call the region whatever you like (this version is also required for the raw for loop overloads of our par_for functions).
  2. It introduces a PARTHENON_AUTO_LABEL macro that results in a string as above -- name_of_file.cpp::line_number::function_name. This macro is now used to label all kernels in parthenon.

Together, we are guaranteed that results from profiling with, e.g., the kokkos-tools simple kernel timer, unambiguously identify the region of code being profiled. It also enables one to write convenient post-processing tools that can manipulate the output of the profiling tools. For example, here's a snippet of the output from a post-processed (with the new badly written python script included in this PR) profile with the simple kernel timer on a downstream code:

driver.cpp::Execute time: 212.289274 %wall: 91.390765
   Kernel summary: Total kernel time: 0.0   % Time in kernels:  0.0

hydro.cpp::CalculateFluxes time: 82.697677 %wall: 35.601441
   type: ParFor line: 859 selftime: 53.822169 %func: 65.0830482215359
   type: ParFor line: 1040 selftime: 28.390177 %func: 34.330077977885644
   Kernel summary: Total kernel time: 82.212346   % Time in kernels:  99.41312619942154

boundary_communication.cpp::SetBounds time: 33.09992 %wall: 14.249552
   type: ParFor line: 234 selftime: 24.448197 %func: 73.8618008744432
   Kernel summary: Total kernel time: 24.448197   % Time in kernels:  73.8618008744432

boundary_communication.cpp::SendBoundBufs time: 22.321809 %wall: 9.609563
   type: ParFor line: 85 selftime: 17.433569 %func: 78.10105802804782
   Kernel summary: Total kernel time: 17.433569   % Time in kernels:  78.10105802804782

With the automatic naming conventions, the script has enough info to pull together all the regions/kernels in a function and give an easily digestible breakdown of which functions are important and where time is being spent within them. To make all that work, the downstream code has to adopt usage of the same macros and automatic naming.

I haven't thought through this, but I believe this model will also make it easy to instrument our code(s) in other ways, beyond the Kokkos::Profiling hooks.

There are a lot of changed files in this PR, but the changes are almost entirely trivial. The only nontrivial changes relate to the fact that the objects these macros create need to be properly scoped, which meant some variable declarations had to be moved around so they were at function scope.

I haven't documented this stuff, but mostly because I don't think we have any documentation right now on the profiling stuff and I just don't have the time at the moment to create that from scratch. I'll get to it in another PR.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

src/utils/instrument.hpp Outdated Show resolved Hide resolved
@jdolence jdolence requested a review from pgrete November 1, 2023 15:10
@jdolence jdolence changed the title WIP: Refactor how we interface with performance instrumentation Refactor how we interface with performance instrumentation Nov 1, 2023
@jdolence
Copy link
Collaborator Author

jdolence commented Nov 1, 2023

@pgrete want to make sure this doesn't interfere with something you're doing downstream. should be really quick to look over, despite the 40 modified files...

Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly double checking: Given that the "standard" push and pop regions are still called in the KokkosTimer all existing instrumentation (and their interfaces to other tools) should work as is (minus some changed names), correct?

I like this approach and would like to briefly try in AthenaPK before pushing approve.
I should be able to test prior to the sync on Thursday.

The key difference to the previous approach (when not using plain Kokkos regions, which is still supported) is that PARTHENON_INSTRUMENT_REGION would need to be scoped as there's no explicit "end" for the macros, right?

Finally, I noticed that you replaced all(?) existing names with the auto based scheme.
How did this translate to your profiling experience in practice?
I imagine a couple of places where the explicit names introduce some value add, e.g., for the Step() function in the driver to differentiate between the MultiStage_Step and the MultiStageBlockTask_Step, which in the current version would differ only in their line number if I'm not mistaken.

src/utils/instrument.hpp Outdated Show resolved Hide resolved
src/utils/instrument.hpp Outdated Show resolved Hide resolved
@Yurlungur Yurlungur mentioned this pull request Nov 7, 2023
8 tasks
@jdolence
Copy link
Collaborator Author

jdolence commented Nov 7, 2023

Quickly double checking: Given that the "standard" push and pop regions are still called in the KokkosTimer all existing instrumentation (and their interfaces to other tools) should work as is (minus some changed names), correct?

Yes, that's correct.

I like this approach and would like to briefly try in AthenaPK before pushing approve. I should be able to test prior to the sync on Thursday.

Cool, sounds good.

The key difference to the previous approach (when not using plain Kokkos regions, which is still supported) is that PARTHENON_INSTRUMENT_REGION would need to be scoped as there's no explicit "end" for the macros, right?

Yes, that's right. We could instead make the calls to push/popRegion with PARTHENON_AUTO_LABEL to avoid the scoping issue (which can be mildly annoying). The reason I replaced everything with the macro is that it should make it easier to change the behavior the profiling everywhere in the code, should we decide to try something else later.

Finally, I noticed that you replaced all(?) existing names with the auto based scheme. How did this translate to your profiling experience in practice? I imagine a couple of places where the explicit names introduce some value add, e.g., for the Step() function in the driver to differentiate between the MultiStage_Step and the MultiStageBlockTask_Step, which in the current version would differ only in their line number if I'm not mistaken.

Yes, I replaced all the names (or at least intended to). The nice thing about this is that it makes it straightforward to post-process the kokkos profiling data (I've really only played with the simple kernel timer output) to give really useful views of performance that are otherwise annoying to get at. For example, how does performance compare across kernels in CalculateFluxes? How much overhead is involved in building our packs? This kind of stuff is easy to see in, for example, the snippet I shared in the PR description.

I think your example is correct. I think they would only differ by the line number. Two comments, though. Presumably you know which driver you're using so I'm not sure there's much ambiguity here in practice. Also, the line number does guarantee an unambiguous pointer to the right piece of code. If a developer were lazy, not thinking carefully, or just otherwise unaware of other things in the code, I think it's possible to name two regions with the same string, which might lead to some confusion.

There are a couple places where I'd agree the line numbers are perhaps slightly more annoying. The two that come to mind in our current code are in the RedistributeAndRefineMeshBlocks and WriteOutputImpl functions where we have multiple regions that profile pieces of those functions. It's not immediately obvious from the line numbers what those regions are doing, whereas a well chosen name might make that more clear. On the other hand, because each region is tagged with the line number and function name, the post processing can pull them all together so it's more clear they are all just parts of the same function. Then if you care about digging in deeper, it's just a line number away.

@pgrete
Copy link
Collaborator

pgrete commented Dec 5, 2023

I now tested this in AthenaPK and it works like a charm.
In addition to the proposed changes, it'd be great to add sth like PARTHENON_INSTRUMENT_REGION_PUSH/POP to the instrumentation header for completeness (as this would allow the explicit use of Kokkos push/pop without calling it directly, i.e., it'd be in line with our general intermediate abstraction layer).
Also I'd be really nice to have a minimal doc point to the high level interface.

@Yurlungur
Copy link
Collaborator

@jdolence can you respond to @pgrete 's comments so we can pull this through?

@jdolence
Copy link
Collaborator Author

@pgrete I think I've at least responded to all your comments. Added PUSH/POP and a quick doc. Did not adopt your suggested change, but I'm betting that's not blocking. Ready to merge?

Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pgrete pgrete enabled auto-merge (squash) January 12, 2024 15:03
@pgrete pgrete merged commit 4899b6c into develop Jan 12, 2024
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants