Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: unload: Cleanup active domains and endpoints #768

Merged
merged 1 commit into from
Jan 30, 2025

Conversation

mozarhua
Copy link
Contributor

@mozarhua mozarhua commented Jan 21, 2025

When aborting with no connection established, there were QPs leaked if any domain and endpoint left open.

This patch adds domain and endpoint cleanup logic at the beginning of rdma and sendrecv device release (which will be called by nccl_net_ofi_plugin_fini() for each device) to prevent the QP leak.

Note the logic can be triggered without #772.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mozarhua mozarhua requested a review from a team as a code owner January 21, 2025 21:17
src/nccl_ofi_net.c Outdated Show resolved Hide resolved
src/nccl_ofi_net.c Outdated Show resolved Hide resolved
@mozarhua mozarhua force-pushed the cleanup-domain-ep-at-unload branch from ac07303 to a12f6ff Compare January 21, 2025 23:26
@Zhenye-Na
Copy link

bot:aws:retest

2 similar comments
@Zhenye-Na
Copy link

bot:aws:retest

@Zhenye-Na
Copy link

bot:aws:retest

@mozarhua mozarhua force-pushed the cleanup-domain-ep-at-unload branch from a12f6ff to e7161a0 Compare January 23, 2025 01:01
include/nccl_ofi.h Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Show resolved Hide resolved
src/nccl_ofi_net.c Outdated Show resolved Hide resolved
src/nccl_ofi_net.c Outdated Show resolved Hide resolved
src/nccl_ofi_net.c Outdated Show resolved Hide resolved
@mozarhua mozarhua force-pushed the cleanup-domain-ep-at-unload branch 2 times, most recently from 245d2fe to 88e21bc Compare January 28, 2025 17:32
src/nccl_ofi_net.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
@mozarhua mozarhua force-pushed the cleanup-domain-ep-at-unload branch from 88e21bc to 02b3309 Compare January 29, 2025 19:50
bwbarrett
bwbarrett previously approved these changes Jan 29, 2025
bwbarrett
bwbarrett previously approved these changes Jan 29, 2025
rauteric
rauteric previously approved these changes Jan 30, 2025
include/nccl_ofi.h Show resolved Hide resolved
When aborting with no connection established, there were QPs leaked if
any domain and endpoint left open. This patch adds domain and endpoint
cleanup logic at the beginning of rdma and sendrecv device release to
prevent the QP leak.

Note the logic can be triggered without aws#772.

Signed-off-by: Mozar Huang <[email protected]>
@mozarhua mozarhua dismissed stale reviews from rauteric and bwbarrett via c0c6df9 January 30, 2025 01:34
@mozarhua mozarhua force-pushed the cleanup-domain-ep-at-unload branch from 65aa4d7 to c0c6df9 Compare January 30, 2025 01:34
@bwbarrett bwbarrett merged commit 0ca2148 into aws:master Jan 30, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants