Skip to content
sharpneli edited this page Oct 29, 2014 · 7 revisions

Initial considerations

The current algorithm is designed so that 32 threads run in sync. With wavefront size of 64 up to half of the potential performance can be wasted.

Pinned memory

In OpenCL Pinned memory can be gained on AMD platform by creating a device buffer with CL_MEM_ALLOC_HOST_PTR and then mapping it.

Total improvement of around 7% was gained in ADH problem using Iceland based GPU

Increasing occupancy

Currently the gromacs main kernel runs with only 2 wavefronts/simd due to using 86 registers. That means 8 total per compute unit. Many attempts at reducing register usage were made but in vain. The AMD compiler does not provide any way of limiting register usage and will forcibly take intermediate results and waste register space for them. Compiling with -cl-opt-disable is of no help because then it spits out what is likely a pure SSA form and uses almost 500 registers (spilling everything to global memory).

The only way that has been able to reduce the register usage has been to make some variables a non compile time constant. This is essentially a random process as making some loops non constant can increase the register usage by 50% and with others it can lower it enough to get us running 12 wavefronts per compute unit, for up to 15% actual speedup on Iceland based hardware.

Wavefront divergence

The current GPU code assumes 32 wide warps. Workgroup is 64 threads. The kernel is optimized so that bunch of 32 threads can independently load and process data. With 64 wide wavefront on GCN up to 50% of performance is wasted in the worst case. In the best case there is no penalty as both of the 32 wide warps would have been active.

Clone this wiki locally