WIP: Attempt to fix hanging at first output #979

bprather · 2023-11-27T16:22:09Z

PR Summary

This is a ham-fisted attempt at fixing an MPI hang in KHARMA on slow filesystems when both SMR and HDF5 outputs are enabled. It appears not to happen for AMR (at least, early on) or single-mesh runs.

The patch as-is doesn't seem to fully fix the issue for me, but it does decrease the frequency a little. If I get a solid fix working I'll post it here and remove WIP.

PR Checklist

pgrete · 2023-11-27T16:41:45Z

Interesting. The proposed change should never result in a performance degradation as far as I can tell.
Out of curiosity: Is this related to the long standing issue you've been fighting?

I'm asking as I've discovered hangs/timeouts starting up a large sim.
All tests I've been doing so far point of the multigrid PR, but it's not clear of it's machine/MPI lib specific or more general (or output related or a combination thereof or something else...)

bprather · 2023-11-27T16:48:15Z

Oh that's great! I assumed we were losing a bit of HDF5 spin-up/file open time by blocking on everybody finishing computation first. Hopefully that's not a lot of time compared to the full output write though, and maybe HDF5 is doing that internally anyway?

Yeah, this is related to hangs when starting any largeish sim on Frontier (becomes a big problem on >8-ish ranks), though I also see it on CPUs in my CI. Anecdotally, it started around the time I bumped Parthenon to include the MG stuff, but I also pulled a bunch of other stuff when I did that, so it's consistent with your description but I wouldn't say it's more evidence.

My other longstanding issue with SMR was (hopefully) solved by a much more mundane fix to overstepping memory when applying domain boundaries. As for the issue with reductions, I don't know. I want to experiment with rolling back my other, custom fix to that issue and see if I can get simulations to both start and finish without hanging forever.

BenWibking · 2023-11-27T17:08:21Z

Are you running with this var set?

export FI_CXI_RX_MATCH_MODE=software

bprather · 2023-11-27T17:15:50Z

It looks like this doesn't solve the hang in all cases for me -- my initial couple of tests seemed to work but I still often see hangs with many blocks. I'll keep messing with it today.

I wasn't using that variable, but testing with/without it doesn't seem to fix the hangs. What does it do?

BenWibking · 2023-11-27T17:40:22Z

I wasn't using that variable, but testing with/without it doesn't seem to fix the hangs. What does it do?

It disables hardware MPI tag matching on Slingshot. For large-scale jobs on Frontier, it's often necessary to prevent the "undeliverable" MPI errors (which seem to be related to the number of in-flight messages on the network). On another system, this also seemed to help prevent hangs I saw during I/O with another code, but I only did limited testing.

bprather · 2023-12-08T17:31:06Z

Closing this. As described, this was only ever present in my downstream due to merge bug, see discussion in #963

Fix hang at startup/first output in KHARMA downstream

b40ea06

bprather changed the title ~~WIP: Fix a hang at first output in KHARMA~~ WIP: Attempt to fix hanging at first output Nov 27, 2023

bprather mentioned this pull request Nov 28, 2023

MPI Hang at first file output under SMR AFD-Illinois/kharma#43

Closed

bprather closed this Dec 8, 2023

bprather deleted the bprather/fix-init-output-hang branch July 23, 2024 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Attempt to fix hanging at first output #979

WIP: Attempt to fix hanging at first output #979

bprather commented Nov 27, 2023 •

edited

Loading

pgrete commented Nov 27, 2023

bprather commented Nov 27, 2023 •

edited

Loading

BenWibking commented Nov 27, 2023

bprather commented Nov 27, 2023

BenWibking commented Nov 27, 2023 •

edited

Loading

bprather commented Dec 8, 2023

WIP: Attempt to fix hanging at first output #979

WIP: Attempt to fix hanging at first output #979

Conversation

bprather commented Nov 27, 2023 • edited Loading

PR Summary

PR Checklist

pgrete commented Nov 27, 2023

bprather commented Nov 27, 2023 • edited Loading

BenWibking commented Nov 27, 2023

bprather commented Nov 27, 2023

BenWibking commented Nov 27, 2023 • edited Loading

bprather commented Dec 8, 2023

bprather commented Nov 27, 2023 •

edited

Loading

bprather commented Nov 27, 2023 •

edited

Loading

BenWibking commented Nov 27, 2023 •

edited

Loading