Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI Hang at first file output under SMR #43

Closed
bprather opened this issue Nov 28, 2023 · 1 comment
Closed

MPI Hang at first file output under SMR #43

bprather opened this issue Nov 28, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@bprather
Copy link
Contributor

bprather commented Nov 28, 2023

When any HDF5 outputs are enabled in KHARMA (i.e. dumps and/or restarts), and it is run with multiple MPI ranks, it can hang at the first file write (just after printing Running X driver...).

This appears to be an issue with Parthenon, possibly introduced by recent MPI changes for new features. Parthenon PR for a possible fix is here: parthenon-hpc-lab/parthenon#979

Disabling all HDF5 outputs avoids this issue, and as it "only" happens ~80% of the time it can sometimes be circumvented just by repeatedly restarting the same run. Disabling SMR (either not refining or using AMR) generally also fixes the issue or at least makes the race condition much less frequent. Might also be an issue of the number of blocks/blocks per rank, though.

@bprather bprather added the bug Something isn't working label Nov 28, 2023
@bprather bprather changed the title MPI Hang at first file output MPI Hang at first file output under SMR Nov 28, 2023
@bprather
Copy link
Contributor Author

bprather commented Dec 4, 2023

This is fixed by the updated form of a Parthenon PR we merged: parthenon-hpc-lab/parthenon#963

Remaining CI failures of feature/parthenon-bump are actual KHARMA bugs, quick stuff related to boundary conditions.

@bprather bprather closed this as completed Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant