Topic/cuda aware communications #671

bosilca · 2024-09-10T04:35:01Z

Add support for sending and receiving the data directly from and to devices. There are few caveats (noted on the commit log).

The first question is how is such a device selected ?

The allocation of such a copy happen way before the scheduler is invoked
for a task, in fact before the task is even ready. Thus, we need to
decide on the location of this copy only based on some static
information, such as the task affinity. Therefore, this approach only
works for owner-compute type of tasks, where the task will be executed
on the device that owns the data used for the task affinity.

Pass the correct data copy across the entire system, instead of
falling back to data copy of the device 0 (CPU memory)

TODOs

rebase on c11 atomic fix
Add a configure option to enable GPU-aware communications.
Add a runtime configuration to turn on/off the gpu-aware comms?
Pass -g 2 tests
Failure with ctest get_best_device scheduling.c:157: int __parsec_execute(parsec_execution_stream_t *, parsec_task_t *): Assertion NULL != copy->original && NULL != copy->original->device_copies[0]'
Failure with ctest nvlink, stress (segfault), details of why (its because using NEW) Topic/cuda aware communications #671 (comment)
Failure with ctest stage (presumably identical to intermittent failure in gemm/potrf) device_gpu.c:2470: int parsec_device_kernel_epilog(parsec_device_gpu_module_t *, parsec_gpu_task_t *): Assertion PARSEC_DATA_STATUS_UNDER _TRANSFER == cpu_copy->data_transfer_status' failed.
RO data between tasks may reach an assert when doing D2D between devices that do not have peer_access between them
readers values are miscounted when 2 or more GPUs are used per rank Topic/cuda aware communications #671 (comment)

abouteiller · 2024-10-11T18:09:56Z

Now passing 1-gpu/node, 8 ranks PTG POTRF
Sorry I had to force-push there were issues with rebasing on master

Signed-off-by: George Bosilca <[email protected]>

This allows to check if the data can be send and received directly to and from GPU buffers. Signed-off-by: George Bosilca <[email protected]>

This is a multi-part patch that allows the CPU to prepare a data copy mapped onto a device. 1. The first question is how is such a device selected ? The allocation of such a copy happen way before the scheduler is invoked for a task, in fact before the task is even ready. Thus, we need to decide on the location of this copy only based on some static information, such as the task affinity. Therefore, this approach only works for owner-compute type of tasks, where the task will be executed on the device that owns the data used for the task affinity. 2. Pass the correct data copy across the entire system, instead of falling back to data copy of the device 0 (CPU memory) Add a configure option to enable GPU-aware communications. Signed-off-by: George Bosilca <[email protected]>

Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <[email protected]>

Signed-off-by: George Bosilca <[email protected]>

copy if we are passed-in a GPU copy, and we need to retain/release the copies that we are swapping

readers

…ut-only flows, for which checking if they are control flows segfaults

G-Ragghianti · 2025-01-13T18:27:43Z

I think we need to create a CI test that targets gpu_nvidia and issues the job to that runner, correct?

abouteiller · 2025-01-13T21:12:41Z

Failure in stress (and similar in nvlink) due to the code generating a pushback event when transferring the last tile between the GEMM -> DISCARD_C flow (m >= mt+1). This tile has no original->device_copies[0] because it was created directly without a backing DC (from a NEW in MAKE_C).

d@00000 GPU[1:cuda(0)]: Retrieve data (if any) for GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_scheduler:2719
d@00000 GPU[1:cuda(0)]: Try to Pop GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_pop:2264
d@00000 GPU[1:cuda(0)]: read copy 0x7ff06462f970 [ref_count 1] on flow A has readers (1) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: read copy 0x7ff064002c10 [ref_count 2] on flow C has readers (0) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: OUT Data copy 0x7ff064002c10 [ref_count 2] for flow C @parsec_device_kernel_pop:2330
Process 2891337 stopped
* thread #11, name = 'stress', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x70)
    frame #0: 0x00007ffff7eaa89b libparsec.so.4`parsec_device_kernel_pop(gpu_device=0x0000555555f7b7b0, gpu_task=0x00007ff06462a8c0, gpu_stream=0x0000555555f7bc68) at device_gpu.c:2341:17
   2338             if( gpu_task->pushout & (1 << i) ) {
   2339                 /* TODO: make sure no readers are working on the CPU version */
   2340                 original = gpu_copy->original;
-> 2341                 PARSEC_DEBUG_VERBOSE(10, parsec_gpu_output_stream,
   2342                                     "GPU[%d:%s]:\tMove D2H data <%s:%x> copy %p [ref_count %d] -- D:%p -> H:%p requested",
   2343                                     gpu_device->super.device_index, gpu_device->super.name, flow->name, original->key, gpu_copy, gpu_copy->super.super.obj_reference_count,
   2344                                      (void*)gpu_copy->device_private, original->device_copies[0]->device_private);

Potential fix is to allocate a dev0copy like is done for the network received tiles, not sure why it doesn't already.

bosilca requested a review from a team as a code owner September 10, 2024 04:35

bosilca force-pushed the topic/cuda_aware_communications branch from 968bf7e to 6f2e034 Compare September 10, 2024 04:38

bosilca mentioned this pull request Sep 10, 2024

Add support for batched tasks and for CUDA-aware communications bosilca/parsec#4

Open

bosilca force-pushed the topic/cuda_aware_communications branch 2 times, most recently from b3dfcdc to 0838a95 Compare September 10, 2024 05:03

This comment was marked as resolved.

Sign in to view

abouteiller force-pushed the topic/cuda_aware_communications branch from efa8386 to ab1a74a Compare October 11, 2024 18:07

This comment was marked as resolved.

Sign in to view

abouteiller force-pushed the topic/cuda_aware_communications branch from cd7c475 to 3bab2d5 Compare October 16, 2024 20:43

therault mentioned this pull request Oct 24, 2024

C11 atomic lock alignment in data_t #685

Merged

bosilca and others added 10 commits October 30, 2024 09:59

Allow JDF with no dependencies, no datatype and no arenas.

fa9438f

Signed-off-by: George Bosilca <[email protected]>

Add a CUDA-based RTT test.

0ab9a2b

This allows to check if the data can be send and received directly to and from GPU buffers. Signed-off-by: George Bosilca <[email protected]>

Mostly improvement to the debuging output.

108b778

Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <[email protected]>

Correctly initialize an unlock atomic lock.

c08f9ca

Signed-off-by: George Bosilca <[email protected]>

gpu-datain: when we are executing a CPU body we need to retrieve the CPU

bff15ea

copy if we are passed-in a GPU copy, and we need to retain/release the copies that we are swapping

gpu-datain: pure distribution collections have no data_of/data_of_key

bf90133

Use lock initializers instead of static temps

bedef4d

gpu-datain: when the RO data-in is from D2D we need to increase its

6099898

readers

gpu-datain: add a configure option and change some namings

3e0cb38

abouteiller force-pushed the topic/cuda_aware_communications branch from eb5c782 to 3e0cb38 Compare October 31, 2024 14:53

When managing CPU-only tasks that received a GPU data copy, skip outp…

82dcd40

…ut-only flows, for which checking if they are control flows segfaults

devreal mentioned this pull request Nov 26, 2024

PaRSEC now allows DSLs to free the gpu task TESSEorg/ttg#307

Open

This was referenced Dec 5, 2024

Thread safe zone_malloc #712

Closed

Make zone-malloc/free thread safe #715

Merged

Merge branch 'master' into topic/cuda_aware_communications

cf59065

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic/cuda aware communications #671

Topic/cuda aware communications #671

bosilca commented Sep 10, 2024 •

edited by abouteiller

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

abouteiller commented Oct 11, 2024

This comment was marked as resolved.

G-Ragghianti commented Jan 13, 2025

abouteiller commented Jan 13, 2025

Topic/cuda aware communications #671

Are you sure you want to change the base?

Topic/cuda aware communications #671

Conversation

bosilca commented Sep 10, 2024 • edited by abouteiller Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

abouteiller commented Oct 11, 2024

This comment was marked as resolved.

G-Ragghianti commented Jan 13, 2025

abouteiller commented Jan 13, 2025

bosilca commented Sep 10, 2024 •

edited by abouteiller

Loading