Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improvements of NCCL clique helper functions #2459

Open
viclafargue opened this issue Sep 27, 2024 · 0 comments
Open

[FEA] Improvements of NCCL clique helper functions #2459

viclafargue opened this issue Sep 27, 2024 · 0 comments
Labels
feature request New feature or request

Comments

@viclafargue
Copy link
Contributor

Is your feature request related to a problem? Please describe.
The unification of the API layers and removal of the mg namespace as described in rapidsai/cuvs#357 require some changes on the RAFT end. Namely, the NCCL clique should now be a core type and its presence in the resource handle inform on the algorithm implementation to run. The PR resolving this issue on the RAFT side should :

  • Set the NCCL clique as a core type
  • Separate NCCL clique initialization from its access and improve the initialization process
  • Leave a separate access function to be used internally by the cuVS library

Describe the solution you'd like

  • The nccl_clique.hpp file should be placed in the raft/core directory and the nccl_clique struct should be placed in the raft::core namespace.
  • A raft::resource::initialize_nccl_clique() function to initialize a NCCL clique and add it to a resource handler. This function would be called before calling an algorithm implementation. The presence of the NCCL clique resource on the resource handler would inform the willingness to run the algorithms in multi-GPU mode. The function could also allow the configuration of the GPUs to include during clique initialization and the percentage of device memory to pre-allocate as a memory pool on each.
  • A raft::resource::get_nccl_clique() function to access the NCCL clique internally inside of implementations.
@viclafargue viclafargue added the feature request New feature or request label Sep 27, 2024
@viclafargue viclafargue changed the title [FEA] Improve NCCL clique initialization and [FEA] Improvements of NCCL clique helper functions Sep 27, 2024
rapids-bot bot pushed a commit that referenced this issue Jan 16, 2025
Introduces the `raft::device_resources_snmg` type to hold all resources required for the NCCL clique.

~Answers #2459
Removed call to `raft::comms::build_comms_nccl_only` (#2465)

Authors:
  - Victor Lafargue (https://github.com/viclafargue)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #2487
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant