Refactor how we interface with performance instrumentation #969

jdolence · 2023-10-31T17:01:17Z

PR Summary

This PR does primarily two things:

It introduces a couple macros that replace our direct calls to Kokkos::Profiling::pushRegion and Kokkos::Profiling::popRegion. Right now, the PARTHENON_INSTRUMENT macro instantiates a (new) KokkosTimer object that pushes a region on construction, with an auto-generated name of the form name_of_file.cpp::line_number::function_name, and then pops the region when the object is destroyed when it goes out of scope. The second macro, PARTHENON_INSTRUMENT_REGION(name), does the same thing but takes a name argument so you can call the region whatever you like (this version is also required for the raw for loop overloads of our par_for functions).
It introduces a PARTHENON_AUTO_LABEL macro that results in a string as above -- name_of_file.cpp::line_number::function_name. This macro is now used to label all kernels in parthenon.

Together, we are guaranteed that results from profiling with, e.g., the kokkos-tools simple kernel timer, unambiguously identify the region of code being profiled. It also enables one to write convenient post-processing tools that can manipulate the output of the profiling tools. For example, here's a snippet of the output from a post-processed (with the new badly written python script included in this PR) profile with the simple kernel timer on a downstream code:

driver.cpp::Execute time: 212.289274 %wall: 91.390765
   Kernel summary: Total kernel time: 0.0   % Time in kernels:  0.0

hydro.cpp::CalculateFluxes time: 82.697677 %wall: 35.601441
   type: ParFor line: 859 selftime: 53.822169 %func: 65.0830482215359
   type: ParFor line: 1040 selftime: 28.390177 %func: 34.330077977885644
   Kernel summary: Total kernel time: 82.212346   % Time in kernels:  99.41312619942154

boundary_communication.cpp::SetBounds time: 33.09992 %wall: 14.249552
   type: ParFor line: 234 selftime: 24.448197 %func: 73.8618008744432
   Kernel summary: Total kernel time: 24.448197   % Time in kernels:  73.8618008744432

boundary_communication.cpp::SendBoundBufs time: 22.321809 %wall: 9.609563
   type: ParFor line: 85 selftime: 17.433569 %func: 78.10105802804782
   Kernel summary: Total kernel time: 17.433569   % Time in kernels:  78.10105802804782

With the automatic naming conventions, the script has enough info to pull together all the regions/kernels in a function and give an easily digestible breakdown of which functions are important and where time is being spent within them. To make all that work, the downstream code has to adopt usage of the same macros and automatic naming.

I haven't thought through this, but I believe this model will also make it easy to instrument our code(s) in other ways, beyond the Kokkos::Profiling hooks.

There are a lot of changed files in this PR, but the changes are almost entirely trivial. The only nontrivial changes relate to the fact that the objects these macros create need to be properly scoped, which meant some variable declarations had to be moved around so they were at function scope.

I haven't documented this stuff, but mostly because I don't think we have any documentation right now on the profiling stuff and I just don't have the time at the moment to create that from scratch. I'll get to it in another PR.

PR Checklist

Code passes cpplint
New features are documented.
Adds a test for any bugs fixed. Adds tests for new features.
Code is formatted
Changes are summarized in CHANGELOG.md
CI has been triggered on Darwin for performance regression tests.
Docs build
(@lanl.gov employees) Update copyright on changed files

Yurlungur

I like this

src/utils/instrument.hpp

… post process output from kokkos simple kernel timer.

jdolence · 2023-11-01T20:49:11Z

@pgrete want to make sure this doesn't interfere with something you're doing downstream. should be really quick to look over, despite the 40 modified files...

pgrete

Quickly double checking: Given that the "standard" push and pop regions are still called in the KokkosTimer all existing instrumentation (and their interfaces to other tools) should work as is (minus some changed names), correct?

I like this approach and would like to briefly try in AthenaPK before pushing approve.
I should be able to test prior to the sync on Thursday.

The key difference to the previous approach (when not using plain Kokkos regions, which is still supported) is that PARTHENON_INSTRUMENT_REGION would need to be scoped as there's no explicit "end" for the macros, right?

Finally, I noticed that you replaced all(?) existing names with the auto based scheme.
How did this translate to your profiling experience in practice?
I imagine a couple of places where the explicit names introduce some value add, e.g., for the Step() function in the driver to differentiate between the MultiStage_Step and the MultiStageBlockTask_Step, which in the current version would differ only in their line number if I'm not mistaken.

src/utils/instrument.hpp

jdolence · 2023-11-07T16:36:21Z

Quickly double checking: Given that the "standard" push and pop regions are still called in the KokkosTimer all existing instrumentation (and their interfaces to other tools) should work as is (minus some changed names), correct?

Yes, that's correct.

I like this approach and would like to briefly try in AthenaPK before pushing approve. I should be able to test prior to the sync on Thursday.

Cool, sounds good.

The key difference to the previous approach (when not using plain Kokkos regions, which is still supported) is that PARTHENON_INSTRUMENT_REGION would need to be scoped as there's no explicit "end" for the macros, right?

Yes, that's right. We could instead make the calls to push/popRegion with PARTHENON_AUTO_LABEL to avoid the scoping issue (which can be mildly annoying). The reason I replaced everything with the macro is that it should make it easier to change the behavior the profiling everywhere in the code, should we decide to try something else later.

Finally, I noticed that you replaced all(?) existing names with the auto based scheme. How did this translate to your profiling experience in practice? I imagine a couple of places where the explicit names introduce some value add, e.g., for the Step() function in the driver to differentiate between the MultiStage_Step and the MultiStageBlockTask_Step, which in the current version would differ only in their line number if I'm not mistaken.

Yes, I replaced all the names (or at least intended to). The nice thing about this is that it makes it straightforward to post-process the kokkos profiling data (I've really only played with the simple kernel timer output) to give really useful views of performance that are otherwise annoying to get at. For example, how does performance compare across kernels in CalculateFluxes? How much overhead is involved in building our packs? This kind of stuff is easy to see in, for example, the snippet I shared in the PR description.

I think your example is correct. I think they would only differ by the line number. Two comments, though. Presumably you know which driver you're using so I'm not sure there's much ambiguity here in practice. Also, the line number does guarantee an unambiguous pointer to the right piece of code. If a developer were lazy, not thinking carefully, or just otherwise unaware of other things in the code, I think it's possible to name two regions with the same string, which might lead to some confusion.

There are a couple places where I'd agree the line numbers are perhaps slightly more annoying. The two that come to mind in our current code are in the RedistributeAndRefineMeshBlocks and WriteOutputImpl functions where we have multiple regions that profile pieces of those functions. It's not immediately obvious from the line numbers what those regions are doing, whereas a well chosen name might make that more clear. On the other hand, because each region is tagged with the line number and function name, the post processing can pull them all together so it's more clear they are all just parts of the same function. Then if you care about digging in deeper, it's just a line number away.

pgrete · 2023-12-05T15:22:08Z

I now tested this in AthenaPK and it works like a charm.
In addition to the proposed changes, it'd be great to add sth like PARTHENON_INSTRUMENT_REGION_PUSH/POP to the instrumentation header for completeness (as this would allow the explicit use of Kokkos push/pop without calling it directly, i.e., it'd be in line with our general intermediate abstraction layer).
Also I'd be really nice to have a minimal doc point to the high level interface.

Yurlungur · 2023-12-14T18:13:38Z

@jdolence can you respond to @pgrete 's comments so we can pull this through?

jdolence · 2024-01-10T17:09:01Z

@pgrete I think I've at least responded to all your comments. Added PUSH/POP and a quick doc. Did not adopt your suggested change, but I'm betting that's not blocking. Ready to merge?

pgrete

LGTM

jdolence added 3 commits October 31, 2023 10:59

wrap profiling calls in a macro/struct

863ec24

actually add the new file

140ca4d

and add it to the cmake

4856d8b

Yurlungur approved these changes Oct 31, 2023

View reviewed changes

src/utils/instrument.hpp Outdated Show resolved Hide resolved

jdolence added 3 commits October 31, 2023 16:40

add some more macros for auto profiling

27dffb4

missed one

410ceb5

fix style

14a1060

jdolence requested a review from pgrete November 1, 2023 15:10

jdolence added 4 commits November 1, 2023 09:33

make auto named profiling variables unique

e9f755e

missed a few

3b70bbb

add missing function profiling calls. add new badly written script to…

220a053

… post process output from kokkos simple kernel timer.

formatting

883323c

jdolence changed the title ~~WIP: Refactor how we interface with performance instrumentation~~ Refactor how we interface with performance instrumentation Nov 1, 2023

jdolence and others added 4 commits November 1, 2023 14:20

just use PARTHENON_INSTRUMENT where possible

243dd32

Merge branch 'develop' into jdolence/instrument

f2dfc17

changelog

57ffd1d

oops, tried to instrument a host-device function

e1c960b

pgrete reviewed Nov 7, 2023

View reviewed changes

src/utils/instrument.hpp Outdated Show resolved Hide resolved

src/utils/instrument.hpp Outdated Show resolved Hide resolved

Yurlungur mentioned this pull request Nov 7, 2023

Add hook for UserWorkBeforeLoop #971

Merged

8 tasks

Merge branch 'develop' into jdolence/instrument

32a616f

jdolence added 3 commits January 9, 2024 17:58

merge develop

cf8fec1

formatting

78e612c

address review comments

20a5007

fix formatting of doc

decffd5

pgrete approved these changes Jan 12, 2024

View reviewed changes

pgrete enabled auto-merge (squash) January 12, 2024 15:03

pgrete merged commit 4899b6c into develop Jan 12, 2024
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor how we interface with performance instrumentation #969

Refactor how we interface with performance instrumentation #969

jdolence commented Oct 31, 2023 •

edited

Loading

Yurlungur left a comment

jdolence commented Nov 1, 2023

pgrete left a comment

jdolence commented Nov 7, 2023

pgrete commented Dec 5, 2023

Yurlungur commented Dec 14, 2023

jdolence commented Jan 10, 2024

pgrete left a comment

Refactor how we interface with performance instrumentation #969

Refactor how we interface with performance instrumentation #969

Conversation

jdolence commented Oct 31, 2023 • edited Loading

PR Summary

PR Checklist

Yurlungur left a comment

Choose a reason for hiding this comment

jdolence commented Nov 1, 2023

pgrete left a comment

Choose a reason for hiding this comment

jdolence commented Nov 7, 2023

pgrete commented Dec 5, 2023

Yurlungur commented Dec 14, 2023

jdolence commented Jan 10, 2024

pgrete left a comment

Choose a reason for hiding this comment

jdolence commented Oct 31, 2023 •

edited

Loading