-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pack_size !=- 1
"Memory access fault" on Frontier
#115
Comments
Does work as expected on GH200, so it seems that the "Memory access fault" is one of the standard Frontier/Lumi/MI250X/Cray errors. |
It's probably an LLVM AMDGPU compiler bug. It's been known for years, but AMD has not been able to fix it: https://discourse.llvm.org/t/how-to-verify-correct-regalloc-for-a-kernel/80811 The cause is when register pressure is high, and there is conditional execution (virtually all of our kernels), it can produce incorrect machine code for restoring registers that have been spilled to memory (due to running out of hardware registers) that trashes the registers that hold memory addresses. Then, boom, memory error and crash. For us, we've only seen it so far with reaction networks (that use ~1000s of registers), but it's as the AMD engineer says in the thread, it's not predictable when it happens, it cannot be verified that any given kernel is compiled correctly, and it's even difficult to see the bug when manually inspecting the generated machine code. |
Here's another example of this kind of compiler bug: llvm/llvm-project#96353 |
yikes... I guess we'll wait and see then. |
The PR that was expected to fix (all?) of these kinds of bugs was just merged into LLVM: llvm/llvm-project#93526. It may be possible to build a working compiler using Spack with |
While running some tests on Frontier I noticed the following issue:
Should be confirmed if this is Frontier specific or more general AthenaPK or Parthenon.
The text was updated successfully, but these errors were encountered: