-
Essentially the question in the title. I understand how to issue Specifically:
Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Have you had a chance to read the CUDA documentation for TMA and multicast over at https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk ? It should answer most of your questions. Copying the relevant part here:
For our CuTe specific abstractions here, the SMEM layout passed to the TMA creator is that of a single CTA not that of the entire cluster. In terms of benefit of using multicast, its exactly as it says on the tin. Instead of all the CTAs loading the data independently, they can cooperatively load smaller chunks of the data and share them along each other. this gets you higher bandwidth from L2 cache than would otherwise be possible. |
Beta Was this translation helpful? Give feedback.
Have you had a chance to read the CUDA documentation for TMA and multicast over at https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk ?
It should answer most of your questions. Copying the relevant part here:
The optional modifier .multicast::cluster allows copying of data from global memory to shared memory of multiple CTAs in the cluster. Operand ctaMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit ctaMask operand corresponds to the %ctaid of the destination CTA. The source data is multicast to the same CTA-relative offset as dstMem in the shared memory of each destination CTA. The mb…