Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdma: simplify init / delay posting rx buffers #772

Merged
merged 1 commit into from
Jan 28, 2025

Conversation

bwbarrett
Copy link
Contributor

Since NCCL generally uses a different thread for initializing the plugin and creating communicators, the quick hack of leaving a mostly configured endpoint from init just resulted in us leaking resources; that endpoint was never actually going to be used. The whole reason for the refcount hack in init was a bug in EFA on P5en at launch where destroying and rapidly creating a new QP when the old QP had rx buffers attached could cause an error.

Rather than keep the whole endpoint around and leaking resources, just delay the parts of the operation that were causing races until after init, when the first communicator is created. Now, endpoint creation during init() doesn't post buffers and is immediately destroyed, avoiding the whole leak. And because either listen or connect will be called before a process does any communication, this doesn't impact correctness.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bwbarrett bwbarrett requested a review from a team as a code owner January 28, 2025 00:10
Since NCCL generally uses a different thread for initializing the
plugin and creating communicators, the quick hack of leaving a mostly
configured endpoint from init just resulted in us leaking resources;
that endpoint was never actually going to be used.  The whole reason
for the refcount hack in init was a bug in EFA on P5en at launch where
destroying and rapidly creating a new QP when the old QP had rx buffers
attached could cause an error.

Rather than keep the whole endpoint around and leaking resources,
just delay the parts of the operation that were causing races until
after init, when the first communicator is created.  Now, endpoint
creation during init() doesn't post buffers and is immediately
destroyed, avoiding the whole leak.  And because either listen
or connect will be called before a process does any communication,
this doesn't impact correctness.

Signed-off-by: Brian Barrett <[email protected]>
@bwbarrett bwbarrett force-pushed the cleanup/better-init branch from a7e8436 to 84d19d7 Compare January 28, 2025 00:11
@bwbarrett
Copy link
Contributor Author

@mozarhua I still think we should finish #768 as a safe guard, but this seems like a simpler way to solve the leak.

@aws-nslick
Copy link
Contributor

I find it really hard to approve this with the confidence that this isn't going to break some user somewhere, but given that the record shows I also approved the original PR that introduced this refcnt hack, I have no right to recuse myself.

Copy link
Contributor

@mozarhua mozarhua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mozarhua I still think we should finish #768 as a safe guard, but this seems like a simpler way to solve the leak.

Yes it is. With this change, #768 is not triggered and no libfabric errors observed (besides the discussed Closing ep with unreleased rxe warnings).

@bwbarrett bwbarrett merged commit 57f424b into aws:master Jan 28, 2025
23 checks passed
@bwbarrett bwbarrett deleted the cleanup/better-init branch January 28, 2025 02:58
mozarhua added a commit to mozarhua/aws-ofi-nccl that referenced this pull request Jan 29, 2025
When aborting with no connection established, there were QPs leaked if
any domain and endpoint left open. This patch adds domain and endpoint
cleanup logic at the beginning of rdma and sendrecv device release to
prevent the QP leak.

Note the logic can be triggered without aws#772.

Signed-off-by: Mozar Huang <[email protected]>
mozarhua added a commit to mozarhua/aws-ofi-nccl that referenced this pull request Jan 29, 2025
When aborting with no connection established, there were QPs leaked if
any domain and endpoint left open. This patch adds domain and endpoint
cleanup logic at the beginning of rdma and sendrecv device release to
prevent the QP leak.

Note the logic can be triggered without aws#772.

Signed-off-by: Mozar Huang <[email protected]>
mozarhua added a commit to mozarhua/aws-ofi-nccl that referenced this pull request Jan 30, 2025
When aborting with no connection established, there were QPs leaked if
any domain and endpoint left open. This patch adds domain and endpoint
cleanup logic at the beginning of rdma and sendrecv device release to
prevent the QP leak.

Note the logic can be triggered without aws#772.

Signed-off-by: Mozar Huang <[email protected]>
bwbarrett pushed a commit that referenced this pull request Jan 30, 2025
When aborting with no connection established, there were QPs leaked if
any domain and endpoint left open. This patch adds domain and endpoint
cleanup logic at the beginning of rdma and sendrecv device release to
prevent the QP leak.

Note the logic can be triggered without #772.

Signed-off-by: Mozar Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants