Skip to content

[QST] What are the differences between SM90_TMA_LOAD and SM90_TMA_LOAD_MULTICAST? #1315

Closed Answered by thakkarV
hyhieu asked this question in Q&A
Discussion options

You must be logged in to vote

Have you had a chance to read the CUDA documentation for TMA and multicast over at https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk ?

It should answer most of your questions. Copying the relevant part here:

The optional modifier .multicast::cluster allows copying of data from global memory to shared memory of multiple CTAs in the cluster. Operand ctaMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit ctaMask operand corresponds to the %ctaid of the destination CTA. The source data is multicast to the same CTA-relative offset as dstMem in the shared memory of each destination CTA. The mb…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@hyhieu
Comment options

Answer selected by hyhieu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants