Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup usage of NCCL error codes #357

Merged
merged 2 commits into from
Feb 29, 2024
Merged

Conversation

bwbarrett
Copy link
Contributor

The majority of this PR is removing the use of NCCL error codes in the core of the library, since we only use Libfabric error codes / -errno error codes to simplify reasoning about what error code is what. This is a follow-on to last year's commit that made the change, since it looks like we missed a couple hundred places.

While cleaning that up, I noticed a place where we double initialized a mutex. While harmless, it's also wrong, so clean that up as well.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

topo_file_lock is static initialized, so there is no need to call
pthread_mutex_init() during the rdma_init() call.

Signed-off-by: Brian Barrett <[email protected]>
@bwbarrett bwbarrett requested a review from a team as a code owner February 27, 2024 22:10
src/nccl_ofi_sendrecv.c Show resolved Hide resolved
src/nccl_ofi_topo.c Outdated Show resolved Hide resolved
src/nccl_ofi_topo.c Outdated Show resolved Hide resolved
src/nccl_ofi_topo.c Outdated Show resolved Hide resolved
src/nccl_ofi_topo.c Outdated Show resolved Hide resolved
src/nccl_ofi_topo.c Outdated Show resolved Hide resolved
src/nccl_ofi_topo.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
Last year, we chose to simplify error code management and only use
Libfabric error codes or -errno codes (which should be equivalent)
in the core of the code, translating at the API layer.  This patch
finishes the previously incomplete patches to make that transition.

Signed-off-by: Brian Barrett <[email protected]>
@bwbarrett bwbarrett added BuildTriggerRequest CI build will be triggered when this label is set and removed BuildTriggerRequest CI build will be triggered when this label is set labels Feb 28, 2024
@AvivBenchorin AvivBenchorin added the BuildTriggerRequest CI build will be triggered when this label is set label Feb 29, 2024
@rajachan rajachan merged commit 3d30cc2 into aws:master Feb 29, 2024
13 checks passed
@bwbarrett bwbarrett deleted the cleanup/errorcodes branch April 12, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BuildTriggerRequest CI build will be triggered when this label is set
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants