-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdma: defer connect completion after sending connect message #353
Conversation
4889515
to
6faa202
Compare
6faa202
to
e87194a
Compare
0b7b8a0
to
2b5729b
Compare
In the current implementation of connect/accept, it is possible for `accept` to complete (i.e., return a non-NULL communicator) after the corresponding `connect` returned a NULL communicator (while waiting for a completion for the connection message). This is a strange semantic, and evidently causes NCCL to be unhappy, particularly in the multi-recv case (which is being added in a future commit). So, after sending the connect message, defer waiting for completion; block when closing the send comm if necessary. Signed-off-by: Eric Raut <[email protected]>
2b5729b
to
ae106d2
Compare
I am seeing very occasional hangs in A2A during connection establishment in the multi-recv branch (not in this PR's branch, which does not have multi-recv). Yesterday it only happened a couple times, this morning it's been every 3-4 runs. So I'm not convinced this PR completely solves the multi-recv connection establishment issues. The lack of easy reproducibility makes it hard to debug. |
I happened to notice that for some reason, with LTTNG tracing enabled, the aforementioned hang happens every time, even in this PR's branch that doesn't have multi-recv. Which makes the problem easier to debug. If I make connect blocking (block until we get a completion for the send connect message, as PR #348 originally did), the hang does not occur. Working theory (needs verification): by returning a communicator to NCCL before the send connect message is sent, we no longer have a way to process the sender side's completion queue to ensure the message gets sent, leading to occasional deadlock. But this needs more investigation, and can't be merged in its current state, so marking as Draft. |
The approach in this PR won't work (see explanation in previous comment), so closing. We'll need a different solution. |
In the current implementation of connect/accept, it is possible for
accept
to complete (i.e., return a non-NULL communicator) after the correspondingconnect
returned a NULL communicator (while waiting for a completion for the connection message). This is a strange semantic, and evidently causes NCCL to be unhappy, particularly in the multi-recv case (which is being added in a future commit).So, after sending the connect message, defer waiting for completion; block when closing the send comm if necessary.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.