-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Attempt to fix hanging at first output #979
Conversation
Interesting. The proposed change should never result in a performance degradation as far as I can tell. I'm asking as I've discovered hangs/timeouts starting up a large sim. |
Oh that's great! I assumed we were losing a bit of HDF5 spin-up/file open time by blocking on everybody finishing computation first. Hopefully that's not a lot of time compared to the full output write though, and maybe HDF5 is doing that internally anyway? Yeah, this is related to hangs when starting any largeish sim on Frontier (becomes a big problem on >8-ish ranks), though I also see it on CPUs in my CI. Anecdotally, it started around the time I bumped Parthenon to include the MG stuff, but I also pulled a bunch of other stuff when I did that, so it's consistent with your description but I wouldn't say it's more evidence. My other longstanding issue with SMR was (hopefully) solved by a much more mundane fix to overstepping memory when applying domain boundaries. As for the issue with reductions, I don't know. I want to experiment with rolling back my other, custom fix to that issue and see if I can get simulations to both start and finish without hanging forever. |
Are you running with this var set?
|
It looks like this doesn't solve the hang in all cases for me -- my initial couple of tests seemed to work but I still often see hangs with many blocks. I'll keep messing with it today. I wasn't using that variable, but testing with/without it doesn't seem to fix the hangs. What does it do? |
It disables hardware MPI tag matching on Slingshot. For large-scale jobs on Frontier, it's often necessary to prevent the "undeliverable" MPI errors (which seem to be related to the number of in-flight messages on the network). On another system, this also seemed to help prevent hangs I saw during I/O with another code, but I only did limited testing. |
Closing this. As described, this was only ever present in my downstream due to merge bug, see discussion in #963 |
PR Summary
This is a ham-fisted attempt at fixing an MPI hang in KHARMA on slow filesystems when both SMR and HDF5 outputs are enabled. It appears not to happen for AMR (at least, early on) or single-mesh runs.
The patch as-is doesn't seem to fully fix the issue for me, but it does decrease the frequency a little. If I get a solid fix working I'll post it here and remove WIP.
PR Checklist