Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for "analysis" restart #963

Merged
merged 10 commits into from
Apr 11, 2024
Merged

Add support for "analysis" restart #963

merged 10 commits into from
Apr 11, 2024

Conversation

pgrete
Copy link
Collaborator

@pgrete pgrete commented Oct 19, 2023

PR Summary

This is a quick and dirty implementation of "analysis" restarts, i.e., restarts that do not enter a driver and only dump a specified subset of outputs.
Restarting works with -a FILE and output are enabled to be dumped for an analysis restart by adding analysis_output=true to the output block.
No other output is being written or data modified and the approach is also memory friendly as no extra memory is allocated (e.g., for different stages in a driver).
Still there's more room for improvement (e.g., by skipping flux allocations but that resulted in some weird errors when I briefly tested it).

I opened this PR in order to get feedback as this really is just a quick and dirty solution (one can consider this PEP2 ;) ).
It somehow feels wrong to start an EvolutionDriver and then skip "evolution" but this approach allows this to be used by downstream codes without any modification.

What do people think?

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this kind of capability is a long time coming! Nice to see it's getting some attention. I am very enthusiastic about the concept. A few (perhaps dumb) questions about this design:

  1. What are the intended hooks as far as doing analysis without launching into the driver? For example, could I not get the same capability implemented by simply writing two drivers in my downstream code, one for time evolution and one for post-processing? Should I think of this as hooks for the downstream code to do just that?
  2. Why does the mesh need to know anything about whether we're running analysis or not? This seems like something that should be handled only at the parthenon manager level?

@pgrete
Copy link
Collaborator Author

pgrete commented Nov 6, 2023

Overall this kind of capability is a long time coming! Nice to see it's getting some attention. I am very enthusiastic about the concept. A few (perhaps dumb) questions about this design:

This is why I consider this PEP2, so let's first discuss what the solution should look like ;)

1. What are the intended hooks as far as _doing_ analysis without launching into the driver? For example, could I not get the same capability implemented by simply writing two drivers in my downstream code, one for time evolution and one for post-processing? Should I think of this as hooks for the downstream code to do just that?

My main intention with the current approach was to make sth available through Parthenon without any modification of downstream codes.
This current version should work as is in any downstream code based on the EvolutionDriver.
An alternative might be to provide a simple AnalysisDriver class/template in Parthenon that can be easily adapted in downstream codes.
As far as I can tell we need at least some modifications in Parthenon, e.g., to handle command line arguments (and potentially additional input).

2. Why does the mesh need to know anything about whether we're running analysis or not? This seems like something that should be handled only at the parthenon manager level?

You're referring to the pmesh->analysis_flag variable, aren't you?
I put it there because it required the fewest changes in LOC.
As you say it logically doesn't really belong there so I could be moved up to the parthenon manager (and then passed down to the/a driver).

@Yurlungur
Copy link
Collaborator

My main intention with the current approach was to make sth available through Parthenon without any modification of downstream codes.
This current version should work as is in any downstream code based on the EvolutionDriver.
An alternative might be to provide a simple AnalysisDriver class/template in Parthenon that can be easily adapted in downstream codes.

Ah I see. I don't think I understood that. I had the AnalysisDriver model in my head, as opposed to the EvolutionDriver model. Your way makes sense, I think, since I think a downstream code can implement their own analysis pipelines with a new driver if they want to already.

You're referring to the pmesh->analysis_flag variable, aren't you?
I put it there because it required the fewest changes in LOC.
As you say it logically doesn't really belong there so I could be moved up to the parthenon manager (and then passed down to the/a driver).

I think I would prefer if this lived in the manager or the driver than in the mesh.

As far as I can tell we need at least some modifications in Parthenon, e.g., to handle command line arguments (and potentially additional input).

Yes with your model I think that makes sense.

@BenWibking
Copy link
Collaborator

BenWibking commented Dec 1, 2023

I tried turning on/off individual output blocks using the command line, e.g.:

./build/bin/athenaPK -a parthenon.restart.final.rhdf parthenon/output2/analysis_output=true parthenon/output1/analysis_output=false parthenon/output3/analysis_output=false

but it seems to always output all of the output blocks:

athenapk git:(forrestglines/parthenon-pr930) ✗ ls -lt | head
total 748596
-rw-r--r--  1 benwibking 19239961 Dec  1 12:12 parthenon.restart.00005.rhdf
-rw-r--r--  1 benwibking    21094 Dec  1 12:12 parthenon.restart.00005.rhdf.xdmf
-rw-r--r--  1 benwibking  9887769 Dec  1 12:12 parthenon.prim.00050.phdf
-rw-r--r--  1 benwibking    25954 Dec  1 12:12 parthenon.prim.00050.phdf.xdmf
-rw-r--r--  1 benwibking      170 Dec  1 12:12 parthenon.hst
-rw-r--r--  1 benwibking 19245585 Dec  1 12:01 parthenon.restart.final.rhdf
-rw-r--r--  1 benwibking    21094 Dec  1 12:01 parthenon.restart.final.rhdf.xdmf
-rw-r--r--  1 benwibking 12791828 Dec  1 12:01 parthenon.prim.final.phdf
-rw-r--r--  1 benwibking    25954 Dec  1 12:01 parthenon.prim.final.phdf.xdmf

This was run with the AthenaPK inputs/turbulence.in input file. Is there something wrong with the above?

@pgrete
Copy link
Collaborator Author

pgrete commented Dec 1, 2023

Does this also happen for a non "final" restart dump?
I might be that the "standard" dumps are being triggered by a different condition.

@BenWibking
Copy link
Collaborator

Does this also happen for a non "final" restart dump? I might be that the "standard" dumps are being triggered by a different condition.

For a non-"final" restart, it doesn't output at all, no matter what options I give it.

@pgrete
Copy link
Collaborator Author

pgrete commented Dec 1, 2023

I just tried with a fresh build and it works for here:

[20:39:12][pgrete@haerke: ~/ana-test-delafter20231202]
$ mpirun -np 4 ../src/athenapk/build-analysis/bin/athenaPK -a parthenon.restart.00003.rhdf parthenon/output2/analysis_output=true 
Blocks assigned to rank 2: 2:2
Blocks assigned to rank 1: 1:1
Blocks assigned to rank 0: 0:0
Blocks assigned to rank 3: 3:3
Var: cons:9
Starting up hydro driver
# Variables in use:
# Package: parthenon::resolved_state
# ---------------------------------------------------
# Variables:
# Name	Metadata flags
# ---------------------------------------------------
turbulence_phases_k       None,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
turbulence_phases_j       None,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
turbulence_phases_i       None,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
acc                       Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
prim                      Cell,Provides,Real,Derived,Hydro,parthenon::resolved_state
cons                      Cell,Provides,Real,Independent,FillGhost,WithFluxes,Hydro,parthenon::resolved_state
# ---------------------------------------------------
# Sparse Variables:
# Name	sparse id	Metadata flags
# ---------------------------------------------------
# ---------------------------------------------------
# Swarms:
# Swarm	Value	metadata
# ---------------------------------------------------


Setup complete, executing driver...

cycle=1125 time=3.0003661614362418e+00 dt=2.2035771144258798e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=1.70e-01 wsec_step=7.27e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 1 x 2 x 2 MeshBlocks
Total number of MeshBlocks = 4
Number of physical refinement levels = 0
Number of logical  refinement levels = 1
  Physical level = 0 (logical level = 1): 4 MeshBlocks, cost = 4
--------------------------------------------------------------------

Driver completed.
time=3.00e+00 cycle=1125
tlim=5.00e+00 nlim=100000

walltime used = 1.70e-01
zone-cycles/wallsecond = 0.00e+00
[20:40:39][pgrete@haerke: ~/ana-test-delafter20231202]
$ ls -lahtr |tail -n 10
-rw-r--r--   1 pgrete users  26K Dec  1 20:37 parthenon.prim.00036.phdf.xdmf
-rw-r--r--   1 pgrete users  13M Dec  1 20:37 parthenon.prim.00036.phdf
-rw-r--r--   1 pgrete users  26K Dec  1 20:38 parthenon.prim.00037.phdf.xdmf
-rw-r--r--   1 pgrete users  13M Dec  1 20:38 parthenon.prim.00037.phdf
-rw-r--r--   1 pgrete users  13K Dec  1 20:38 parthenon.hst
-rw-r--r--   1 pgrete users  26K Dec  1 20:38 parthenon.prim.00038.phdf.xdmf
-rw-r--r--   1 pgrete users  13M Dec  1 20:38 parthenon.prim.00038.phdf
NEW -rw-r--r--   1 pgrete users  26K Dec  1 20:40 parthenon.prim.00031.phdf.xdmf
NEW -rw-r--r--   1 pgrete users 9.5M Dec  1 20:40 parthenon.prim.00031.phdf
drwxr-xr-x   2 pgrete users  12K Dec  1 20:40 .

you can also add parthenon/output2/id=analysis to get a distinct filename:

-rw-r--r--   1 pgrete users  26K Dec  1 20:43 parthenon.analysis.00031.phdf.xdmf
-rw-r--r--   1 pgrete users 9.5M Dec  1 20:43 parthenon.analysis.00031.phdf

Can you double check that your Parthenon submodule is up to date?

@BenWibking
Copy link
Collaborator

Ah, yeah, it was out of date. It works now.

@bprather
Copy link
Collaborator

bprather commented Dec 1, 2023

This PR is the source of our hanging woes, @pgrete. The constructor isn't setting the Mesh::analysis flag to 0 automatically, so you need to set it in all cases, not just when it should be true.

This was causing some MPI ranks to take the analysis output codepath, and some not, which had 3 possible effects:

  1. Maybe some compilers zero the new memory? No hangs on those, but I don't think any of my tested compilers really do that
  2. Some ranks took the analysis path but some didn't, hanging the process when they don't respond to boundary exchanges
  3. Rank 0 takes the analysis path, prematurely ending execution and printing the final stats.

EDIT proposing changes wouldn't let me add the new line so I committed the change, feel free to revert/etc. The formatting issue is in an unrelated area...

Apply a proposed fix for a hang introduced by analysis outputs
@pgrete
Copy link
Collaborator Author

pgrete commented Dec 5, 2023

I'm not questioning that this solves the issue, I'm just trying to understand "why" because I'm concerned that the issue may be somewhere deeper.
Both mesh constructors have nbnew(), nbdel(), analysis_flag(false), step_since_lb(), gflag(),, so how/why does the fix work? What am I missing?

@bprather
Copy link
Collaborator

bprather commented Dec 5, 2023

Hmm, yeah it looks like my downstream didn't have those initializers. They were lost when I merged the MG stuff, which also modifies both Mesh() constructors, and choosing Luke's changes overwrote the new constructors without me noticing (thought it was just formatting). I'm not sure if that says more about my lazy merging habits, or a code structure that encourages modifying a single central object for everything from output types to meshblock interaction changes...

Either way, my change probably only fixes my case, then. Feel free to revert/merge how you'd like, I'll just pull the results again when everything hits dev.

I will say though that I like the structure of initializing the different options for analysis_flag in the same place, rather than relying on the constructors. There's not a style reason to prefer one or the other from what I can tell, I just find it easier to read personally since the initializations at the top of Mesh are so long and complex.

Copy link
Collaborator

@bprather bprather left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After pulling this and using it a couple times, this works nicely with either .rhdf or .phdf files. I think KHARMA needs some more features before we'll use this much, but it definitely works fine and (down to my bad merge) doesn't break anything.

I don't mind at all the implementation being in EvolutionDriver, though I think we should just rename it DefaultDriver or BaseDriver at this point. So, approving.

However, I think there's a way of doing this that avoids adding another member to Mesh and instead uses ParameterInput to store the flag, potentially to the extent of being able to do analysis runs or restarts entirely with input files, e.g. ./downstream -i inputfile.par parthenon/job/do_analysis_only=true parthenon/job/restartfile=prob.outX.XXXXX.rhdf. Beyond being nice for changing input parameters upon analysis/restart in a systematic way (and leaving a record of all parameters, not just whatever's in the file), this might be a starting pattern toward detangling the current Mesh and Outputs code so that not as many features rely on it directly.

@pgrete pgrete changed the base branch from pgrete/histogram-outputs to develop April 10, 2024 15:26
@pgrete
Copy link
Collaborator Author

pgrete commented Apr 10, 2024

I now updated the changes following your comments, specifically

  • there's no new Mesh variable any more (the info is now stored in pinput)
  • added a short documentation also highlighted the combination with passing an input file for modifications (and fixed some rst warnings along the way)

I also decided to keep the -a flag rather than fully relying on the input modifications as I'm personally in favor of a clear separation between restarting and doing postprocessing that is guaranteed to not enter the main loop (though I'm happy to be convinced otherwise).

Would be great if you @Yurlungur @bprather could double check if you're happy with those changes.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot cleaner and clearer now. I like it. 👍

@bprather
Copy link
Collaborator

Great! Yeah I wasn't (or, I don't think I was) advocating for removing the -a flag. Just, I think the current code really nicely separates modes from meshes, so this is exactly what I could have hoped for. Thanks!

@pgrete pgrete enabled auto-merge (squash) April 11, 2024 07:08
@pgrete pgrete merged commit 2858a1a into develop Apr 11, 2024
49 checks passed
@pgrete pgrete deleted the pgrete/analysis-outputs branch April 11, 2024 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants