Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Attempt to fix hanging at first output #979

Closed
wants to merge 1 commit into from

Conversation

bprather
Copy link
Collaborator

@bprather bprather commented Nov 27, 2023

PR Summary

This is a ham-fisted attempt at fixing an MPI hang in KHARMA on slow filesystems when both SMR and HDF5 outputs are enabled. It appears not to happen for AMR (at least, early on) or single-mesh runs.

The patch as-is doesn't seem to fully fix the issue for me, but it does decrease the frequency a little. If I get a solid fix working I'll post it here and remove WIP.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@pgrete
Copy link
Collaborator

pgrete commented Nov 27, 2023

Interesting. The proposed change should never result in a performance degradation as far as I can tell.
Out of curiosity: Is this related to the long standing issue you've been fighting?

I'm asking as I've discovered hangs/timeouts starting up a large sim.
All tests I've been doing so far point of the multigrid PR, but it's not clear of it's machine/MPI lib specific or more general (or output related or a combination thereof or something else...)

@bprather
Copy link
Collaborator Author

bprather commented Nov 27, 2023

Oh that's great! I assumed we were losing a bit of HDF5 spin-up/file open time by blocking on everybody finishing computation first. Hopefully that's not a lot of time compared to the full output write though, and maybe HDF5 is doing that internally anyway?

Yeah, this is related to hangs when starting any largeish sim on Frontier (becomes a big problem on >8-ish ranks), though I also see it on CPUs in my CI. Anecdotally, it started around the time I bumped Parthenon to include the MG stuff, but I also pulled a bunch of other stuff when I did that, so it's consistent with your description but I wouldn't say it's more evidence.

My other longstanding issue with SMR was (hopefully) solved by a much more mundane fix to overstepping memory when applying domain boundaries. As for the issue with reductions, I don't know. I want to experiment with rolling back my other, custom fix to that issue and see if I can get simulations to both start and finish without hanging forever.

@BenWibking
Copy link
Collaborator

Are you running with this var set?

export FI_CXI_RX_MATCH_MODE=software

@bprather
Copy link
Collaborator Author

It looks like this doesn't solve the hang in all cases for me -- my initial couple of tests seemed to work but I still often see hangs with many blocks. I'll keep messing with it today.

I wasn't using that variable, but testing with/without it doesn't seem to fix the hangs. What does it do?

@bprather bprather changed the title WIP: Fix a hang at first output in KHARMA WIP: Attempt to fix hanging at first output Nov 27, 2023
@BenWibking
Copy link
Collaborator

BenWibking commented Nov 27, 2023

I wasn't using that variable, but testing with/without it doesn't seem to fix the hangs. What does it do?

It disables hardware MPI tag matching on Slingshot. For large-scale jobs on Frontier, it's often necessary to prevent the "undeliverable" MPI errors (which seem to be related to the number of in-flight messages on the network). On another system, this also seemed to help prevent hangs I saw during I/O with another code, but I only did limited testing.

@bprather
Copy link
Collaborator Author

bprather commented Dec 8, 2023

Closing this. As described, this was only ever present in my downstream due to merge bug, see discussion in #963

@bprather bprather closed this Dec 8, 2023
@bprather bprather deleted the bprather/fix-init-output-hang branch July 23, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants