Skip to content

Commit

Permalink
assert wavefront size (AMReX-Codes#3777)
Browse files Browse the repository at this point in the history
## Summary

Add assert checking wavefront size for HIP build. It can hopefully alert
to problems and fail loudly for users trying to run on Navi cards,
rather than produce incorrect simulation results (until the underlying
issue is fixed). I opened an internal ticket that I hope will get the
underlying issue fixed soon.


## Additional background

AMReX users reported to AMD incorrect results with Navi cards.
I think there is a bug with the compiler's definition of
`__AMDGCN_WAVEFRONT_SIZE` when compiling for target gfx10XX, and find it
disagrees with `warpSize` that one can check at runtime.
For example, compiling for target gfx1030 and running on Navi card (with
an up-to-date ROCm):
```
#include <hip/hip_runtime.h>

int main () {
  hipInit(0);
  int device_id;
  hipGetDevice(&device_id);
  hipDeviceProp_t device_prop;
  hipGetDeviceProperties(&device_prop, device_id);
  printf("warpSize = %d --- __AMDGCN_WAVEFRONT_SIZE = %d \n", device_prop.warpSize, __AMDGCN_WAVEFRONT_SIZE);
  return 0;
}
```
For Navi card, I got output `warpSize = 32 --- __AMDGCN_WAVEFRONT_SIZE =
64`, but of course expected 32 for both.

Strangely, in an actual application compiled for target gfx1030, all the
kernel metadata says the wavefront size is 32 (even with
`__AMDGCN_WAVEFRONT_SIZE` apparently 64), so when manually overriding
the wavefront size to 32 in AMReX built for target gfx1030, the
simulation results seemed convergent to gfx9XX results. So maybe that
wrong value is not polluting the kernel compilation.

## Checklist

The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [ ] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate
  • Loading branch information
mrowan137 authored Feb 23, 2024
1 parent f924a1d commit 2ecafcf
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions Src/Base/AMReX_GpuDevice.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,8 @@ Device::initialize_gpu ()

AMREX_HIP_SAFE_CALL(hipGetDeviceProperties(&device_prop, device_id));

AMREX_ALWAYS_ASSERT_WITH_MESSAGE(warp_size == device_prop.warpSize, "Incorrect warp size");

// check compute capability

// AMD devices do not support shared cache banking.
Expand Down

0 comments on commit 2ecafcf

Please sign in to comment.