Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to untagged send/recv and remove SAS ordering requirement in the RDMA protocol #361

Merged
merged 6 commits into from
Mar 30, 2024

Conversation

AmedeoSapio
Copy link

@AmedeoSapio AmedeoSapio commented Mar 14, 2024

Switching to using untagged send/recv operations in the RDMA protocol. Tags were used for connection request and response, while control messages and eager messages used tag 0. The data previously present in the tag has been moved to the message itself.
In addition, the control message is now using send() instead of senddata(), and the data is carried directly in the message.

Switching to untagged ops means that libfabric won't pre-select the message provided to the recv, so all SENDs are received into bounce buffers now.

This PR also removes the Send-After-Send ordering requirement from libfabric.
The only case where out-of-order sends can be a problem is if a receiver sends a connect response followed by a control message, and they arrive in the opposite order on the sender side. To deal with this rare case, we add a list of control messages in the send communicator to store those early control messages, and process them after the connect response is received.

This also fixes a bug in the accept() error path, that caused accept() to return "success" even in case of errors.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@AmedeoSapio AmedeoSapio added the BuildTriggerRequest CI build will be triggered when this label is set label Mar 14, 2024
@AmedeoSapio AmedeoSapio requested a review from a team as a code owner March 14, 2024 01:46
@rauteric
Copy link
Contributor

Thinking about this some more, I don't think the early ctrl special case is needed.

In function handle_ctrl_recv, the case where the ctrl message comes before isend() is really simple: link. We just add the ctrl message to the message buffer and return. Even if the send comm hasn't yet received the connection response, the message buffer will still be initialized (we do this in create_send_comm before we even send the connect message: link).

So I think we should just handle the early ctrl the same way we handle all cases where the receiver starts before sender, and not have this special case.

src/nccl_ofi_rdma.c Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
include/nccl_ofi_rdma.h Outdated Show resolved Hide resolved
@AmedeoSapio AmedeoSapio force-pushed the remove_tagged_ops_and_SAS branch from 12c0be4 to 1e1ffc9 Compare March 17, 2024 23:07
@AmedeoSapio AmedeoSapio requested a review from rauteric March 17, 2024 23:08
rauteric
rauteric previously approved these changes Mar 18, 2024
Copy link
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level nit: there are still a few places in rdma.c that refer to "tag" (besides the arguments to send and recv, which refers to the multi-recv tag rather than the Libfabric tag).

src/nccl_ofi_rdma.c Show resolved Hide resolved
src/nccl_ofi_rdma.c Show resolved Hide resolved
src/nccl_ofi_rdma.c Show resolved Hide resolved
src/nccl_ofi_ofiutils.c Outdated Show resolved Hide resolved
include/nccl_ofi_rdma.h Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Show resolved Hide resolved
Removed unused lcomm parameter from prepare_recv_comm, and fixed a typo.

Signed-off-by: Amedeo Sapio <[email protected]>
rajachan
rajachan previously approved these changes Mar 21, 2024
src/nccl_ofi_rdma.c Show resolved Hide resolved
@AmedeoSapio AmedeoSapio added BuildTriggerRequest CI build will be triggered when this label is set and removed BuildTriggerRequest CI build will be triggered when this label is set labels Mar 21, 2024
include/nccl_ofi_rdma.h Outdated Show resolved Hide resolved
@AmedeoSapio AmedeoSapio force-pushed the remove_tagged_ops_and_SAS branch from e3f9fd4 to 15d0adf Compare March 21, 2024 23:45
@AmedeoSapio AmedeoSapio added BuildTriggerRequest CI build will be triggered when this label is set and removed BuildTriggerRequest CI build will be triggered when this label is set labels Mar 22, 2024
@AmedeoSapio
Copy link
Author

@rauteric I tested it and confirmed that you are right, if the control message is received before connect response, the control message is added to the msgbuf and retrieved when the isend is called. So there is really no need to do anything new to deal with out-of-order packets. So I will add the commit that removes SAS here as well.

rauteric
rauteric previously approved these changes Mar 22, 2024
include/nccl_ofi_rdma.h Outdated Show resolved Hide resolved
include/nccl_ofi_rdma.h Show resolved Hide resolved
include/nccl_ofi_rdma.h Outdated Show resolved Hide resolved
include/nccl_ofi_rdma.h Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
Reduced size of communicator ID (both for SENDRECV and RDMA) to 32 bits, since
the communicator ID is transmitted on the wire and 32 bits are more than enough.

Signed-off-by: Amedeo Sapio <[email protected]>
Switching to using untagged send/recv operations. Tags were used for connection
request and response, while control messages and eager messages used tag 0.
The data previously present in the tag has been moved to the message itself.
In addition, the control message is now using send() instead of senddata(), and
the data is carried directly in the message.
Switching to untagged ops means that libfabric won't pre-select the message
provided to the recv, so all SENDs are received into bounce buffers.

Signed-off-by: Amedeo Sapio <[email protected]>
In case of error, accept() tries to close the communicator, and that operation
was overwriting the return value (i.e., the original error code).
This means that in case of error, if the close is successful, accept didn't
return an error code, leading to infinite retries.
This also fixes a typo.

Signed-off-by: Amedeo Sapio <[email protected]>
Added static asserts to check packing of messages sent on the wire.
For SENDRECV protocol the only message is connect.
For RDMA protocol the messages are connect, connect response and control.

Signed-off-by: Amedeo Sapio <[email protected]>
There is no case in which not having Send-After-Send ordering causes issues,
so it is safe to remove it.
The only possible reordering that can happen is if a ctrl message is received
before a connect response message, but we already deal with this case correctly,
since the control message is added to the msgbuf and retrieved when isend is called.

Signed-off-by: Amedeo Sapio <[email protected]>
@AmedeoSapio AmedeoSapio force-pushed the remove_tagged_ops_and_SAS branch from a6ce048 to 1330ae5 Compare March 27, 2024 17:00
@rajachan rajachan added BuildTriggerRequest CI build will be triggered when this label is set and removed BuildTriggerRequest CI build will be triggered when this label is set labels Mar 28, 2024
@bwbarrett bwbarrett added BuildTriggerRequest CI build will be triggered when this label is set and removed BuildTriggerRequest CI build will be triggered when this label is set labels Mar 29, 2024
@AmedeoSapio AmedeoSapio added BuildTriggerRequest CI build will be triggered when this label is set and removed BuildTriggerRequest CI build will be triggered when this label is set labels Mar 29, 2024
@rauteric rauteric added BuildTriggerRequest CI build will be triggered when this label is set and removed BuildTriggerRequest CI build will be triggered when this label is set labels Mar 30, 2024
@bwbarrett bwbarrett merged commit bff5831 into aws:master Mar 30, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BuildTriggerRequest CI build will be triggered when this label is set
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants