[QST] What are the differences between SM90_TMA_LOAD and SM90_TMA_LOAD_MULTICAST? #1315

hyhieu · 2024-01-21T17:39:08Z

hyhieu
Jan 21, 2024

Essentially the question in the title.

I understand how to issue SM90_TMA_LOAD instructions, and I vaguely understand that SM90_TMA_LOAD_MULTICAST uses CTA clusters, but I cannot find any documentations that explain SM90_TMA_LOAD_MULTICAST.

Specifically:

What is the role of the mcast_mask binary masks?
When we create a tma object using make_tma_copy(SM90_TMA_LOAD_MULTICAST{}, ...), should we use the smem_layout of a single CTA, or the smem_layout of the cluster?
What are the benefits of SM90_TMA_LOAD_MULTICAST over SM90_TMA_LOAD?
Are there any documentations that explain SM90_TMA_LOAD_MULTICAST?

Thank you!

Answered by thakkarV

Jan 22, 2024

Have you had a chance to read the CUDA documentation for TMA and multicast over at https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk ?

It should answer most of your questions. Copying the relevant part here:

The optional modifier .multicast::cluster allows copying of data from global memory to shared memory of multiple CTAs in the cluster. Operand ctaMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit ctaMask operand corresponds to the %ctaid of the destination CTA. The source data is multicast to the same CTA-relative offset as dstMem in the shared memory of each destination CTA. The mb…

View full answer

hwu36 · 2024-01-22T03:52:16Z

hwu36
Jan 22, 2024
Maintainer

@thakkarV

0 replies

thakkarV · 2024-01-22T18:04:13Z

thakkarV
Jan 22, 2024
Collaborator

Have you had a chance to read the CUDA documentation for TMA and multicast over at https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk ?

It should answer most of your questions. Copying the relevant part here:

The optional modifier .multicast::cluster allows copying of data from global memory to shared memory of multiple CTAs in the cluster. Operand ctaMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit ctaMask operand corresponds to the %ctaid of the destination CTA. The source data is multicast to the same CTA-relative offset as dstMem in the shared memory of each destination CTA. The mbarrier signal is also multicast to the same CTA-relative offset as mbar in the shared memory of the destination CTA.

For our CuTe specific abstractions here, the SMEM layout passed to the TMA creator is that of a single CTA not that of the entire cluster.

In terms of benefit of using multicast, its exactly as it says on the tin. Instead of all the CTAs loading the data independently, they can cooperatively load smaller chunks of the data and share them along each other. this gets you higher bandwidth from L2 cache than would otherwise be possible.

1 reply

hyhieu Jan 22, 2024
Author

Amazing. Yes, that answers basically all of my questions. Thank you @thakkarV

And no, I didn't know there is that paragraph buried in that document 😅 So thanks again for the reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] What are the differences between SM90_TMA_LOAD and SM90_TMA_LOAD_MULTICAST? #1315

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

[QST] What are the differences between SM90_TMA_LOAD and SM90_TMA_LOAD_MULTICAST? #1315

hyhieu Jan 21, 2024

Replies: 2 comments · 1 reply

hwu36 Jan 22, 2024 Maintainer

thakkarV Jan 22, 2024 Collaborator

hyhieu Jan 22, 2024 Author

hyhieu
Jan 21, 2024

Replies: 2 comments 1 reply

hwu36
Jan 22, 2024
Maintainer

thakkarV
Jan 22, 2024
Collaborator

hyhieu Jan 22, 2024
Author