Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NCCL plugin v9 interface #767

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yexiang-aws
Copy link

Description of changes:

This PR is to support the v9 interface of the NCCL plugin interface:

  1. imports new headers from NCCL v2.24.3 which contains the definition of the new net-plugin v9 API . The primary changes for NCCL are:
    1. NIC fusion / vitual NIC support
    2. API supports indication when completion is not needed ( e.g. LL128)
    3. DeviceProps supports max transfer size for p2p/coll operations
  2. adds plugin support for the new net-plugin v9 API. This involves updating the unit-tests to use the v9 API, exposing a ncclNetPlugin_v9 symbol and implementing the getProperties() API for v9.

Test

make passed
NCCL test all_reduce_perf completed without error

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

tests/functional/ring.c Outdated Show resolved Hide resolved
@aws-nslick
Copy link
Contributor

aws-nslick commented Jan 18, 2025

actions are not running on this commit automatically -- this is on your manager. You need to do the aws github training and then you need to be in https://github.com/orgs/aws/teams/aws-ofi

@yexiang-aws yexiang-aws force-pushed the feature/adopt-v9-ext-net-api/NCCLOFI-1066 branch 5 times, most recently from 72736a0 to a6d3909 Compare January 22, 2025 17:03
@yexiang-aws yexiang-aws marked this pull request as ready for review January 22, 2025 17:42
@yexiang-aws yexiang-aws requested review from bwbarrett and a team as code owners January 22, 2025 17:42
Copy link
Member

@rajachan rajachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations on your first PR!

Comment on lines 59 to 60
props->maxP2pBytes = NCCL_MAX_NET_SIZE_BYTES;
props->maxCollBytes = NCCL_MAX_NET_SIZE_BYTES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we'd talked about not changing behavior compared to v8 with the first set of changes... but this property is straightforward enough that we might as well set it to the correct max size now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to set it to the correct max size right now. This code would open us to truncation errors in both send and recv. NCCL_MAX_NET_SIZE_BYTES is 1 TiB, but we really need to set these to2^31-1` until we change the interface to have sizes be size_t instead of int, which also will have implications in the rdma protocol, as I believe there are some places that we're assuming we only need 31 bits for the length today in control headers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will update both to using INT32_MAX , and leave the NCCL_MAX_NET_SIZE_BYTES definition in net.h to be in-sync with NCCL

Copy link
Author

@yexiang-aws yexiang-aws Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a retro: no unit testing enforcement in the dev process will make catching issues rely on PEs and a best effort. Having mentioned earlier today that existing UT coverage is not a good story yet. Can we at least advocate that new code requires UT coverage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should have caught this earlier. It should be INT_MAX, not INT32_MAX, as the previous interface used an unsized int for the variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a unit test for this specific comment thread look like?

The right answer for us is whatever gets filled in here, which is ultimately defined in card firmware and derived from the hardware engines.

Sometimes the right answer is just the right answer and the only way to know it is to know it.

Copy link
Contributor

@aws-nslick aws-nslick Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, what we we doing here? this should be present in the info, why are we filling constants at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I made a similar comment to Nick where we define this constant. This shouldn't be a constant we're putting int he properties. It needs to be a property in the plugin prop struct which is filled in by the two protocols. A constant is pointless here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I didn't fully think through the limits implied from protocol and libfabric. updated in the next revision using ep_attr->max_msg_size to get the limit from libfabric. let me know how that looks

.gitignore Outdated
@@ -78,3 +78,6 @@ m4/lt~obsolete.m4
.idea/
.devenv/
.direnv

.vscode
install/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: also, this should probably be in its own commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vscode IDE configuration directory
install/ is the build artifact folder after make install

adding to a separate commit in next revision

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #773 for a conversation among the team yesterday about editor droppings and .gitignore. I mostly agree with Nick on this one, so would suggest we just drop this commit.

3rd-party/Makefile.am Show resolved Hide resolved
src/nccl_ofi_api.c Outdated Show resolved Hide resolved
Comment on lines 59 to 60
props->maxP2pBytes = NCCL_MAX_NET_SIZE_BYTES;
props->maxCollBytes = NCCL_MAX_NET_SIZE_BYTES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to set it to the correct max size right now. This code would open us to truncation errors in both send and recv. NCCL_MAX_NET_SIZE_BYTES is 1 TiB, but we really need to set these to2^31-1` until we change the interface to have sizes be size_t instead of int, which also will have implications in the rdma protocol, as I believe there are some places that we're assuming we only need 31 bits for the length today in control headers.

tests/functional/nccl_message_transfer.c Outdated Show resolved Hide resolved
tests/functional/ring.c Show resolved Hide resolved
@Zhenye-Na
Copy link

bot:aws:retest

@yexiang-aws yexiang-aws force-pushed the feature/adopt-v9-ext-net-api/NCCLOFI-1066 branch 3 times, most recently from be5956b to 1b63f2c Compare January 23, 2025 16:27
include/nccl_ofi.h Outdated Show resolved Hide resolved
src/nccl_ofi_api.c Outdated Show resolved Hide resolved
include/nccl_ofi.h Outdated Show resolved Hide resolved
src/nccl_ofi_api.c Outdated Show resolved Hide resolved
src/nccl_ofi_api.c Outdated Show resolved Hide resolved
src/nccl_ofi_api.c Outdated Show resolved Hide resolved
include/nccl_ofi.h Outdated Show resolved Hide resolved
tests/functional/nccl_message_transfer.c Outdated Show resolved Hide resolved
tests/functional/ring.c Outdated Show resolved Hide resolved
tests/functional/ring.c Outdated Show resolved Hide resolved
@yexiang-aws yexiang-aws force-pushed the feature/adopt-v9-ext-net-api/NCCLOFI-1066 branch 2 times, most recently from a45c02d to 6aba1e2 Compare January 23, 2025 22:13
include/nccl_ofi.h Outdated Show resolved Hide resolved
include/nccl_ofi.h Outdated Show resolved Hide resolved
include/nccl_ofi.h Outdated Show resolved Hide resolved
src/nccl_ofi_api.c Outdated Show resolved Hide resolved
src/nccl_ofi_api.c Show resolved Hide resolved
src/nccl_ofi_api.c Show resolved Hide resolved
src/nccl_ofi_api.c Outdated Show resolved Hide resolved
Comment on lines 59 to 60
props->maxP2pBytes = NCCL_MAX_NET_SIZE_BYTES;
props->maxCollBytes = NCCL_MAX_NET_SIZE_BYTES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I made a similar comment to Nick where we define this constant. This shouldn't be a constant we're putting int he properties. It needs to be a property in the plugin prop struct which is filled in by the two protocols. A constant is pointless here.

@yexiang-aws yexiang-aws force-pushed the feature/adopt-v9-ext-net-api/NCCLOFI-1066 branch 2 times, most recently from f0e839d to 2bc2990 Compare January 25, 2025 02:06
src/nccl_ofi_api.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
include/nccl_ofi.h Outdated Show resolved Hide resolved
@@ -424,6 +424,9 @@ static int set_nic_props_default(int dev_id, struct fi_info *nic_prov,

props->dmabuf_support = false;

props->max_p2p_bytes = nic_prov->ep_attr->max_msg_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this indenting is right? GitHub doesn't think it matches the code around it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think this code does what you want. Set the max size to 0 here, and make the transports fill it in. With the EFA provider in zero-copy mode, nic_prov->ep_attr->max_msg_size is going to be 8192, which is the max send size. We'll have to do some work to get the max rdma size (which is one of the inputs to max send size). Let's talk about this internally with Shi. This is right for the sendrecv protocol, however.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't pay attention that here indentation type is tab instead of whitespaces. Will fix it.

I did sync with @shijin-aws on this and confirmed that even plugin sets msg with fi_setopt that enables zero-copy mode, ep_attr->max_msg_size is still max of all supported operations, instead of 8192. Let me start a 3-way sync on Slack

Ye Xiang
  16:50
sorry, one more question,  if plugin sets msg with fi_setopt that enables zero-copy mode,  ep_attr->max_msg_size is still max of all supported operations, is it?

Shi Jin
  [16:56]
yes
ep_attr is not going to be changed

src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
@yexiang-aws yexiang-aws marked this pull request as draft January 27, 2025 22:12
@yexiang-aws yexiang-aws force-pushed the feature/adopt-v9-ext-net-api/NCCLOFI-1066 branch 2 times, most recently from b89abe2 to 39818b2 Compare January 27, 2025 22:56
@yexiang-aws yexiang-aws marked this pull request as ready for review January 27, 2025 22:58
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed one little (but important) issue in the last code review. Sorry!

Other than the abort check stuff, this looks good to me.

.gitignore Outdated
@@ -78,3 +78,6 @@ m4/lt~obsolete.m4
.idea/
.devenv/
.direnv

.vscode
install/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #773 for a conversation among the team yesterday about editor droppings and .gitignore. I mostly agree with Nick on this one, so would suggest we just drop this commit.

{
ncclResult_t validation_result = msg_length_verify_max_size(&size, 1);
if (validation_result != ncclSuccess) {
return validation_result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't catch this earlier, but this should be return check_return(validation_result) so that the abort check happens correctly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

sizes == NULL || tags == NULL ||
mhandles == NULL || request == NULL)) {
NCCL_OFI_WARN("Invalid argument: NULL pointer detected");
return ncclInvalidArgument;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above about using return_check().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ncclResult_t nccl_net_ofi_irecv_v9(void* recvComm, int n, void** data,
size_t* sizes, int* tags, void** mhandles, void** request)
{
int sizesInt[NCCL_OFI_MAX_RECVS];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you initialize it please? There's no need to memset or anything crazy like that, just = {};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*request = NULL;
}

ncclResult_t validation_result = msg_length_verify_max_size(sizes, n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, can you elaborate it ? Do you mean explicitly cast to const when passing arguments to a function that expects const parameters? I think that's not mandatory, Implicit conversion is ok

aws-nslick
aws-nslick previously approved these changes Jan 28, 2025
Copy link
Contributor

@aws-nslick aws-nslick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm besides nits

Comment on lines 120 to 132
static inline ncclResult_t msg_length_verify_max_size(const size_t *sizes, const size_t len) {
if (OFI_UNLIKELY(sizes == NULL)) {
NCCL_OFI_WARN("Invalid argument: NULL pointer provided for sizes array");
return ncclInvalidArgument;
}

for (size_t i = 0; i < len; i++) {
if (OFI_UNLIKELY(sizes[i] > INT_MAX)) {
NCCL_OFI_WARN("Message size %zu exceeds maximum allowed size %d at index %zu", sizes[i], INT_MAX, i);
return ncclInternalError;
}
}
return ncclSuccess;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I dint get in on this PR fun earlier. I'm not a fan of this validation function only because the validation can fail only if NCCL doesn't honor the limit, which would be a programming error not runtime error. Why take the check on each send path instead of a debug assert?

For the function as-is, compare against the max size we reported to nccl, not INT_MAX?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's completely undebuggable if it does happen. It'll be a silent truncation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare against the max size we reported to nccl, not INT_MAX

That'll require to call get_properties and get from properties. This size check is in the send path, which requires API change. Since currently we know it's INT_MAX for both protos, can we use as it? When we work on changing the max to be more accurate for each proto, then we can look into how to use the max size we report to NCCL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Raghu's just saying "don't check at all", not to check against the actual properties. But all we're doing here is checking for truncation, and this should be a transient thing. The next set of patches should move the internal APIs from int to size_t (and then the copy will be from int -> size_t for the old functions), and that will remove this check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's what I was saying. And I was suggesting leaning on debug mode runs in testing to catch what this check would have caught at runtime.... but leaving it in till we update the internal API sounds like a plan.

@yexiang-aws yexiang-aws force-pushed the feature/adopt-v9-ext-net-api/NCCLOFI-1066 branch from 39818b2 to 3e88463 Compare January 29, 2025 18:25
@yexiang-aws yexiang-aws marked this pull request as draft January 29, 2025 18:27
@yexiang-aws yexiang-aws marked this pull request as ready for review January 29, 2025 18:30
Update the Nvidia interface to support the v9 interface introduced
in NCCL 2.24.  The primary changes for NCCL are:
1. NIC fusion / vitual NIC support
2. API supports indication when completion is not needed (e.g. LL128)
3. DeviceProps supports max transfer size for p2p/coll operations
@yexiang-aws yexiang-aws force-pushed the feature/adopt-v9-ext-net-api/NCCLOFI-1066 branch from 3e88463 to 3f7c1b3 Compare January 29, 2025 20:39
@yexiang-aws yexiang-aws marked this pull request as draft January 29, 2025 20:41
@yexiang-aws yexiang-aws marked this pull request as ready for review January 29, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants