Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic/cuda aware communications #671

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

bosilca
Copy link
Contributor

@bosilca bosilca commented Sep 10, 2024

Add support for sending and receiving the data directly from and to devices. There are few caveats (noted on the commit log).

  1. The first question is how is such a device selected ?

The allocation of such a copy happen way before the scheduler is invoked
for a task, in fact before the task is even ready. Thus, we need to
decide on the location of this copy only based on some static
information, such as the task affinity. Therefore, this approach only
works for owner-compute type of tasks, where the task will be executed
on the device that owns the data used for the task affinity.

  1. Pass the correct data copy across the entire system, instead of
    falling back to data copy of the device 0 (CPU memory)

TODOs

  • rebase on c11 atomic fix
  • Add a configure option to enable GPU-aware communications.
  • Add a runtime configuration to turn on/off the gpu-aware comms?
  • Pass -g 2 tests
  • Failure with ctest get_best_device scheduling.c:157: int __parsec_execute(parsec_execution_stream_t *, parsec_task_t *): Assertion NULL != copy->original && NULL != copy->original->device_copies[0]'
  • Failure with ctest nvlink, stress (segfault), details of why (its because using NEW) Topic/cuda aware communications #671 (comment)
  • Failure with ctest stage (presumably identical to intermittent failure in gemm/potrf) device_gpu.c:2470: int parsec_device_kernel_epilog(parsec_device_gpu_module_t *, parsec_gpu_task_t *): Assertion PARSEC_DATA_STATUS_UNDER _TRANSFER == cpu_copy->data_transfer_status' failed.
  • RO data between tasks may reach an assert when doing D2D between devices that do not have peer_access between them
  • readers values are miscounted when 2 or more GPUs are used per rank Topic/cuda aware communications #671 (comment)

@bosilca bosilca requested a review from a team as a code owner September 10, 2024 04:35
@bosilca bosilca force-pushed the topic/cuda_aware_communications branch from 968bf7e to 6f2e034 Compare September 10, 2024 04:38
@bosilca bosilca force-pushed the topic/cuda_aware_communications branch 2 times, most recently from b3dfcdc to 0838a95 Compare September 10, 2024 05:03
@abouteiller

This comment was marked as resolved.

@abouteiller

This comment was marked as resolved.

@devreal

This comment was marked as resolved.

@abouteiller abouteiller force-pushed the topic/cuda_aware_communications branch from efa8386 to ab1a74a Compare October 11, 2024 18:07
@abouteiller
Copy link
Contributor

Now passing 1-gpu/node, 8 ranks PTG POTRF
Sorry I had to force-push there were issues with rebasing on master

@abouteiller

This comment was marked as resolved.

@abouteiller abouteiller force-pushed the topic/cuda_aware_communications branch from cd7c475 to 3bab2d5 Compare October 16, 2024 20:43
bosilca and others added 10 commits October 30, 2024 09:59
This allows to check if the data can be send and received directly to
and from GPU buffers.

Signed-off-by: George Bosilca <[email protected]>
This is a multi-part patch that allows the CPU to prepare a data copy
mapped onto a device.

1. The first question is how is such a device selected ?

The allocation of such a copy happen way before the scheduler is invoked
for a task, in fact before the task is even ready. Thus, we need to
decide on the location of this copy only based on some static
information, such as the task affinity. Therefore, this approach only
works for owner-compute type of tasks, where the task will be executed
on the device that owns the data used for the task affinity.

2. Pass the correct data copy across the entire system, instead of
   falling back to data copy of the device 0 (CPU memory)

Add a configure option to enable GPU-aware communications.

Signed-off-by: George Bosilca <[email protected]>
Name the data_t allocated for temporaries allowing developers to track
them through the execution. Add the keys to all outputs (tasks and
copies).

Signed-off-by: George Bosilca <[email protected]>
copy if we are passed-in a GPU copy, and we need to retain/release the
copies that we are swapping
@abouteiller abouteiller force-pushed the topic/cuda_aware_communications branch from eb5c782 to 3e0cb38 Compare October 31, 2024 14:53
…ut-only flows, for which checking if they are control flows segfaults
@G-Ragghianti
Copy link
Contributor

I think we need to create a CI test that targets gpu_nvidia and issues the job to that runner, correct?

@abouteiller
Copy link
Contributor

Failure in stress (and similar in nvlink) due to the code generating a pushback event when transferring the last tile between the GEMM -> DISCARD_C flow (m >= mt+1). This tile has no original->device_copies[0] because it was created directly without a backing DC (from a NEW in MAKE_C).

d@00000 GPU[1:cuda(0)]: Retrieve data (if any) for GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_scheduler:2719
d@00000 GPU[1:cuda(0)]: Try to Pop GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_pop:2264
d@00000 GPU[1:cuda(0)]: read copy 0x7ff06462f970 [ref_count 1] on flow A has readers (1) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: read copy 0x7ff064002c10 [ref_count 2] on flow C has readers (0) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: OUT Data copy 0x7ff064002c10 [ref_count 2] for flow C @parsec_device_kernel_pop:2330
Process 2891337 stopped
* thread #11, name = 'stress', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x70)
    frame #0: 0x00007ffff7eaa89b libparsec.so.4`parsec_device_kernel_pop(gpu_device=0x0000555555f7b7b0, gpu_task=0x00007ff06462a8c0, gpu_stream=0x0000555555f7bc68) at device_gpu.c:2341:17
   2338             if( gpu_task->pushout & (1 << i) ) {
   2339                 /* TODO: make sure no readers are working on the CPU version */
   2340                 original = gpu_copy->original;
-> 2341                 PARSEC_DEBUG_VERBOSE(10, parsec_gpu_output_stream,
   2342                                     "GPU[%d:%s]:\tMove D2H data <%s:%x> copy %p [ref_count %d] -- D:%p -> H:%p requested",
   2343                                     gpu_device->super.device_index, gpu_device->super.name, flow->name, original->key, gpu_copy, gpu_copy->super.super.obj_reference_count,
   2344                                      (void*)gpu_copy->device_private, original->device_copies[0]->device_private);

Potential fix is to allocate a dev0copy like is done for the network received tiles, not sure why it doesn't already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants