Add support for "analysis" restart #963

pgrete · 2023-10-19T15:21:53Z

PR Summary

This is a quick and dirty implementation of "analysis" restarts, i.e., restarts that do not enter a driver and only dump a specified subset of outputs.
Restarting works with -a FILE and output are enabled to be dumped for an analysis restart by adding analysis_output=true to the output block.
No other output is being written or data modified and the approach is also memory friendly as no extra memory is allocated (e.g., for different stages in a driver).
Still there's more room for improvement (e.g., by skipping flux allocations but that resulted in some weird errors when I briefly tested it).

I opened this PR in order to get feedback as this really is just a quick and dirty solution (one can consider this PEP2 ;) ).
It somehow feels wrong to start an EvolutionDriver and then skip "evolution" but this approach allows this to be used by downstream codes without any modification.

What do people think?

PR Checklist

Code passes cpplint
New features are documented.
Adds a test for any bugs fixed. Adds tests for new features.
Code is formatted
Changes are summarized in CHANGELOG.md
CI has been triggered on Darwin for performance regression tests.
Docs build
(@lanl.gov employees) Update copyright on changed files

Yurlungur

Overall this kind of capability is a long time coming! Nice to see it's getting some attention. I am very enthusiastic about the concept. A few (perhaps dumb) questions about this design:

What are the intended hooks as far as doing analysis without launching into the driver? For example, could I not get the same capability implemented by simply writing two drivers in my downstream code, one for time evolution and one for post-processing? Should I think of this as hooks for the downstream code to do just that?
Why does the mesh need to know anything about whether we're running analysis or not? This seems like something that should be handled only at the parthenon manager level?

pgrete · 2023-11-06T17:08:49Z

Overall this kind of capability is a long time coming! Nice to see it's getting some attention. I am very enthusiastic about the concept. A few (perhaps dumb) questions about this design:

This is why I consider this PEP2, so let's first discuss what the solution should look like ;)

1. What are the intended hooks as far as _doing_ analysis without launching into the driver? For example, could I not get the same capability implemented by simply writing two drivers in my downstream code, one for time evolution and one for post-processing? Should I think of this as hooks for the downstream code to do just that?

My main intention with the current approach was to make sth available through Parthenon without any modification of downstream codes.
This current version should work as is in any downstream code based on the EvolutionDriver.
An alternative might be to provide a simple AnalysisDriver class/template in Parthenon that can be easily adapted in downstream codes.
As far as I can tell we need at least some modifications in Parthenon, e.g., to handle command line arguments (and potentially additional input).

2. Why does the mesh need to know anything about whether we're running analysis or not? This seems like something that should be handled only at the parthenon manager level?

You're referring to the pmesh->analysis_flag variable, aren't you?
I put it there because it required the fewest changes in LOC.
As you say it logically doesn't really belong there so I could be moved up to the parthenon manager (and then passed down to the/a driver).

Yurlungur · 2023-11-06T18:10:59Z

My main intention with the current approach was to make sth available through Parthenon without any modification of downstream codes.
This current version should work as is in any downstream code based on the EvolutionDriver.
An alternative might be to provide a simple AnalysisDriver class/template in Parthenon that can be easily adapted in downstream codes.

Ah I see. I don't think I understood that. I had the AnalysisDriver model in my head, as opposed to the EvolutionDriver model. Your way makes sense, I think, since I think a downstream code can implement their own analysis pipelines with a new driver if they want to already.

You're referring to the pmesh->analysis_flag variable, aren't you?
I put it there because it required the fewest changes in LOC.
As you say it logically doesn't really belong there so I could be moved up to the parthenon manager (and then passed down to the/a driver).

I think I would prefer if this lived in the manager or the driver than in the mesh.

As far as I can tell we need at least some modifications in Parthenon, e.g., to handle command line arguments (and potentially additional input).

Yes with your model I think that makes sense.

BenWibking · 2023-12-01T17:13:57Z

I tried turning on/off individual output blocks using the command line, e.g.:

./build/bin/athenaPK -a parthenon.restart.final.rhdf parthenon/output2/analysis_output=true parthenon/output1/analysis_output=false parthenon/output3/analysis_output=false

but it seems to always output all of the output blocks:

➜  athenapk git:(forrestglines/parthenon-pr930) ✗ ls -lt | head
total 748596
-rw-r--r--  1 benwibking 19239961 Dec  1 12:12 parthenon.restart.00005.rhdf
-rw-r--r--  1 benwibking    21094 Dec  1 12:12 parthenon.restart.00005.rhdf.xdmf
-rw-r--r--  1 benwibking  9887769 Dec  1 12:12 parthenon.prim.00050.phdf
-rw-r--r--  1 benwibking    25954 Dec  1 12:12 parthenon.prim.00050.phdf.xdmf
-rw-r--r--  1 benwibking      170 Dec  1 12:12 parthenon.hst
-rw-r--r--  1 benwibking 19245585 Dec  1 12:01 parthenon.restart.final.rhdf
-rw-r--r--  1 benwibking    21094 Dec  1 12:01 parthenon.restart.final.rhdf.xdmf
-rw-r--r--  1 benwibking 12791828 Dec  1 12:01 parthenon.prim.final.phdf
-rw-r--r--  1 benwibking    25954 Dec  1 12:01 parthenon.prim.final.phdf.xdmf

This was run with the AthenaPK inputs/turbulence.in input file. Is there something wrong with the above?

pgrete · 2023-12-01T19:17:05Z

Does this also happen for a non "final" restart dump?
I might be that the "standard" dumps are being triggered by a different condition.

BenWibking · 2023-12-01T19:20:12Z

Does this also happen for a non "final" restart dump? I might be that the "standard" dumps are being triggered by a different condition.

For a non-"final" restart, it doesn't output at all, no matter what options I give it.

pgrete · 2023-12-01T19:44:12Z

I just tried with a fresh build and it works for here:

[20:39:12][pgrete@haerke: ~/ana-test-delafter20231202]
$ mpirun -np 4 ../src/athenapk/build-analysis/bin/athenaPK -a parthenon.restart.00003.rhdf parthenon/output2/analysis_output=true 
Blocks assigned to rank 2: 2:2
Blocks assigned to rank 1: 1:1
Blocks assigned to rank 0: 0:0
Blocks assigned to rank 3: 3:3
Var: cons:9
Starting up hydro driver
# Variables in use:
# Package: parthenon::resolved_state
# ---------------------------------------------------
# Variables:
# Name	Metadata flags
# ---------------------------------------------------
turbulence_phases_k       None,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
turbulence_phases_j       None,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
turbulence_phases_i       None,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
acc                       Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
prim                      Cell,Provides,Real,Derived,Hydro,parthenon::resolved_state
cons                      Cell,Provides,Real,Independent,FillGhost,WithFluxes,Hydro,parthenon::resolved_state
# ---------------------------------------------------
# Sparse Variables:
# Name	sparse id	Metadata flags
# ---------------------------------------------------
# ---------------------------------------------------
# Swarms:
# Swarm	Value	metadata
# ---------------------------------------------------


Setup complete, executing driver...

cycle=1125 time=3.0003661614362418e+00 dt=2.2035771144258798e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=1.70e-01 wsec_step=7.27e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 1 x 2 x 2 MeshBlocks
Total number of MeshBlocks = 4
Number of physical refinement levels = 0
Number of logical  refinement levels = 1
  Physical level = 0 (logical level = 1): 4 MeshBlocks, cost = 4
--------------------------------------------------------------------

Driver completed.
time=3.00e+00 cycle=1125
tlim=5.00e+00 nlim=100000

walltime used = 1.70e-01
zone-cycles/wallsecond = 0.00e+00
[20:40:39][pgrete@haerke: ~/ana-test-delafter20231202]
$ ls -lahtr |tail -n 10
-rw-r--r--   1 pgrete users  26K Dec  1 20:37 parthenon.prim.00036.phdf.xdmf
-rw-r--r--   1 pgrete users  13M Dec  1 20:37 parthenon.prim.00036.phdf
-rw-r--r--   1 pgrete users  26K Dec  1 20:38 parthenon.prim.00037.phdf.xdmf
-rw-r--r--   1 pgrete users  13M Dec  1 20:38 parthenon.prim.00037.phdf
-rw-r--r--   1 pgrete users  13K Dec  1 20:38 parthenon.hst
-rw-r--r--   1 pgrete users  26K Dec  1 20:38 parthenon.prim.00038.phdf.xdmf
-rw-r--r--   1 pgrete users  13M Dec  1 20:38 parthenon.prim.00038.phdf
NEW -rw-r--r--   1 pgrete users  26K Dec  1 20:40 parthenon.prim.00031.phdf.xdmf
NEW -rw-r--r--   1 pgrete users 9.5M Dec  1 20:40 parthenon.prim.00031.phdf
drwxr-xr-x   2 pgrete users  12K Dec  1 20:40 .

you can also add parthenon/output2/id=analysis to get a distinct filename:

-rw-r--r--   1 pgrete users  26K Dec  1 20:43 parthenon.analysis.00031.phdf.xdmf
-rw-r--r--   1 pgrete users 9.5M Dec  1 20:43 parthenon.analysis.00031.phdf

Can you double check that your Parthenon submodule is up to date?

BenWibking · 2023-12-01T19:54:09Z

Ah, yeah, it was out of date. It works now.

bprather · 2023-12-01T22:09:12Z

This PR is the source of our hanging woes, @pgrete. The constructor isn't setting the Mesh::analysis flag to 0 automatically, so you need to set it in all cases, not just when it should be true.

This was causing some MPI ranks to take the analysis output codepath, and some not, which had 3 possible effects:

Maybe some compilers zero the new memory? No hangs on those, but I don't think any of my tested compilers really do that
Some ranks took the analysis path but some didn't, hanging the process when they don't respond to boundary exchanges
Rank 0 takes the analysis path, prematurely ending execution and printing the final stats.

EDIT proposing changes wouldn't let me add the new line so I committed the change, feel free to revert/etc. The formatting issue is in an unrelated area...

Apply a proposed fix for a hang introduced by analysis outputs

pgrete · 2023-12-05T10:30:28Z

I'm not questioning that this solves the issue, I'm just trying to understand "why" because I'm concerned that the issue may be somewhere deeper.
Both mesh constructors have nbnew(), nbdel(), analysis_flag(false), step_since_lb(), gflag(),, so how/why does the fix work? What am I missing?

bprather · 2023-12-05T16:26:13Z

Hmm, yeah it looks like my downstream didn't have those initializers. They were lost when I merged the MG stuff, which also modifies both Mesh() constructors, and choosing Luke's changes overwrote the new constructors without me noticing (thought it was just formatting). I'm not sure if that says more about my lazy merging habits, or a code structure that encourages modifying a single central object for everything from output types to meshblock interaction changes...

Either way, my change probably only fixes my case, then. Feel free to revert/merge how you'd like, I'll just pull the results again when everything hits dev.

I will say though that I like the structure of initializing the different options for analysis_flag in the same place, rather than relying on the constructors. There's not a style reason to prefer one or the other from what I can tell, I just find it easier to read personally since the initializations at the top of Mesh are so long and complex.

bprather

After pulling this and using it a couple times, this works nicely with either .rhdf or .phdf files. I think KHARMA needs some more features before we'll use this much, but it definitely works fine and (down to my bad merge) doesn't break anything.

I don't mind at all the implementation being in EvolutionDriver, though I think we should just rename it DefaultDriver or BaseDriver at this point. So, approving.

However, I think there's a way of doing this that avoids adding another member to Mesh and instead uses ParameterInput to store the flag, potentially to the extent of being able to do analysis runs or restarts entirely with input files, e.g. ./downstream -i inputfile.par parthenon/job/do_analysis_only=true parthenon/job/restartfile=prob.outX.XXXXX.rhdf. Beyond being nice for changing input parameters upon analysis/restart in a systematic way (and leaving a record of all parameters, not just whatever's in the file), this might be a starting pattern toward detangling the current Mesh and Outputs code so that not as many features rely on it directly.

pgrete · 2024-04-10T15:33:01Z

I now updated the changes following your comments, specifically

there's no new Mesh variable any more (the info is now stored in pinput)
added a short documentation also highlighted the combination with passing an input file for modifications (and fixed some rst warnings along the way)

I also decided to keep the -a flag rather than fully relying on the input modifications as I'm personally in favor of a clear separation between restarting and doing postprocessing that is guaranteed to not enter the main loop (though I'm happy to be convinced otherwise).

Would be great if you @Yurlungur @bprather could double check if you're happy with those changes.

Yurlungur

This is a lot cleaner and clearer now. I like it. 👍

bprather · 2024-04-10T15:42:09Z

Great! Yeah I wasn't (or, I don't think I was) advocating for removing the -a flag. Just, I think the current code really nicely separates modes from meshes, so this is exactly what I could have hoped for. Thanks!

pgrete added 4 commits October 16, 2023 16:51

Add trigger for analysis restarts

0931c72

no final output for analysis output

32fd19a

Merge branch 'pgrete/histogram-outputs' into pgrete/analysis-outputs

8af38e4

Merge branch 'pgrete/histogram-outputs' into pgrete/analysis-outputs

725deed

pgrete requested review from Yurlungur, brryan, bprather, jdolence and lroberts36 October 19, 2023 15:21

Yurlungur reviewed Oct 26, 2023

View reviewed changes

Update parthenon_manager.cpp

8958ee4

Apply a proposed fix for a hang introduced by analysis outputs

bprather mentioned this pull request Dec 4, 2023

MPI Hang at first file output under SMR AFD-Illinois/kharma#43

Closed

bprather approved these changes Dec 5, 2023

View reviewed changes

bprather mentioned this pull request Dec 8, 2023

WIP: Attempt to fix hanging at first output #979

Closed

12 tasks

pgrete added 4 commits April 10, 2024 16:07

Merge branch 'develop' into pgrete/analysis-outputs

e60683b

Remove Mesh analysis_flag in favor of input one

3fcb7e6

Add doc

9e8cdcc

Fix rst warnings

d040ea0

pgrete changed the base branch from pgrete/histogram-outputs to develop April 10, 2024 15:26

Yurlungur approved these changes Apr 10, 2024

View reviewed changes

Merge branch 'develop' into pgrete/analysis-outputs

f8e8b2b

pgrete enabled auto-merge (squash) April 11, 2024 07:08

pgrete merged commit 2858a1a into develop Apr 11, 2024
49 checks passed

pgrete deleted the pgrete/analysis-outputs branch April 11, 2024 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for "analysis" restart #963

Add support for "analysis" restart #963

pgrete commented Oct 19, 2023

Yurlungur left a comment

pgrete commented Nov 6, 2023

Yurlungur commented Nov 6, 2023

BenWibking commented Dec 1, 2023 •

edited

Loading

pgrete commented Dec 1, 2023

BenWibking commented Dec 1, 2023

pgrete commented Dec 1, 2023

BenWibking commented Dec 1, 2023

bprather commented Dec 1, 2023 •

edited

Loading

pgrete commented Dec 5, 2023

bprather commented Dec 5, 2023

bprather left a comment •

edited

Loading

pgrete commented Apr 10, 2024

Yurlungur left a comment

bprather commented Apr 10, 2024

Add support for "analysis" restart #963

Add support for "analysis" restart #963

Conversation

pgrete commented Oct 19, 2023

PR Summary

PR Checklist

Yurlungur left a comment

Choose a reason for hiding this comment

pgrete commented Nov 6, 2023

Yurlungur commented Nov 6, 2023

BenWibking commented Dec 1, 2023 • edited Loading

pgrete commented Dec 1, 2023

BenWibking commented Dec 1, 2023

pgrete commented Dec 1, 2023

BenWibking commented Dec 1, 2023

bprather commented Dec 1, 2023 • edited Loading

pgrete commented Dec 5, 2023

bprather commented Dec 5, 2023

bprather left a comment • edited Loading

Choose a reason for hiding this comment

pgrete commented Apr 10, 2024

Yurlungur left a comment

Choose a reason for hiding this comment

bprather commented Apr 10, 2024

BenWibking commented Dec 1, 2023 •

edited

Loading

bprather commented Dec 1, 2023 •

edited

Loading

bprather left a comment •

edited

Loading