-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support NCCL plugin v9 interface #767
base: master
Are you sure you want to change the base?
Support NCCL plugin v9 interface #767
Conversation
actions are not running on this commit automatically -- this is on your manager. You need to do the aws github training and then you need to be in https://github.com/orgs/aws/teams/aws-ofi |
72736a0
to
a6d3909
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations on your first PR!
src/nccl_ofi_interface_nvidia.c
Outdated
props->maxP2pBytes = NCCL_MAX_NET_SIZE_BYTES; | ||
props->maxCollBytes = NCCL_MAX_NET_SIZE_BYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we'd talked about not changing behavior compared to v8 with the first set of changes... but this property is straightforward enough that we might as well set it to the correct max size now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to set it to the correct max size right now. This code would open us to truncation errors in both send and recv. NCCL_MAX_NET_SIZE_BYTES
is 1 TiB, but we really need to set these to
2^31-1` until we change the interface to have sizes be size_t instead of int, which also will have implications in the rdma protocol, as I believe there are some places that we're assuming we only need 31 bits for the length today in control headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, will update both to using INT32_MAX
, and leave the NCCL_MAX_NET_SIZE_BYTES
definition in net.h to be in-sync with NCCL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a retro: no unit testing enforcement in the dev process will make catching issues rely on PEs and a best effort. Having mentioned earlier today that existing UT coverage is not a good story yet. Can we at least advocate that new code requires UT coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should have caught this earlier. It should be INT_MAX
, not INT32_MAX
, as the previous interface used an unsized int
for the variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does a unit test for this specific comment thread look like?
The right answer for us is whatever gets filled in here, which is ultimately defined in card firmware and derived from the hardware engines.
Sometimes the right answer is just the right answer and the only way to know it is to know it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, what we we doing here? this should be present in the info, why are we filling constants at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I made a similar comment to Nick where we define this constant. This shouldn't be a constant we're putting int he properties. It needs to be a property in the plugin prop struct which is filled in by the two protocols. A constant is pointless here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I didn't fully think through the limits implied from protocol and libfabric. updated in the next revision using ep_attr->max_msg_size
to get the limit from libfabric. let me know how that looks
.gitignore
Outdated
@@ -78,3 +78,6 @@ m4/lt~obsolete.m4 | |||
.idea/ | |||
.devenv/ | |||
.direnv | |||
|
|||
.vscode | |||
install/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: also, this should probably be in its own commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vscode IDE configuration directory
install/ is the build artifact folder after make install
adding to a separate commit in next revision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #773 for a conversation among the team yesterday about editor droppings and .gitignore. I mostly agree with Nick on this one, so would suggest we just drop this commit.
src/nccl_ofi_interface_nvidia.c
Outdated
props->maxP2pBytes = NCCL_MAX_NET_SIZE_BYTES; | ||
props->maxCollBytes = NCCL_MAX_NET_SIZE_BYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to set it to the correct max size right now. This code would open us to truncation errors in both send and recv. NCCL_MAX_NET_SIZE_BYTES
is 1 TiB, but we really need to set these to
2^31-1` until we change the interface to have sizes be size_t instead of int, which also will have implications in the rdma protocol, as I believe there are some places that we're assuming we only need 31 bits for the length today in control headers.
bot:aws:retest |
be5956b
to
1b63f2c
Compare
a45c02d
to
6aba1e2
Compare
src/nccl_ofi_interface_nvidia.c
Outdated
props->maxP2pBytes = NCCL_MAX_NET_SIZE_BYTES; | ||
props->maxCollBytes = NCCL_MAX_NET_SIZE_BYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I made a similar comment to Nick where we define this constant. This shouldn't be a constant we're putting int he properties. It needs to be a property in the plugin prop struct which is filled in by the two protocols. A constant is pointless here.
f0e839d
to
2bc2990
Compare
src/nccl_ofi_net.c
Outdated
@@ -424,6 +424,9 @@ static int set_nic_props_default(int dev_id, struct fi_info *nic_prov, | |||
|
|||
props->dmabuf_support = false; | |||
|
|||
props->max_p2p_bytes = nic_prov->ep_attr->max_msg_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure this indenting is right? GitHub doesn't think it matches the code around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't think this code does what you want. Set the max size to 0 here, and make the transports fill it in. With the EFA provider in zero-copy mode, nic_prov->ep_attr->max_msg_size is going to be 8192, which is the max send size. We'll have to do some work to get the max rdma size (which is one of the inputs to max send size). Let's talk about this internally with Shi. This is right for the sendrecv protocol, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't pay attention that here indentation type is tab instead of whitespaces. Will fix it.
I did sync with @shijin-aws on this and confirmed that even plugin sets msg with fi_setopt
that enables zero-copy mode, ep_attr->max_msg_size
is still max of all supported operations, instead of 8192. Let me start a 3-way sync on Slack
Ye Xiang
16:50
sorry, one more question, if plugin sets msg with fi_setopt that enables zero-copy mode, ep_attr->max_msg_size is still max of all supported operations, is it?
Shi Jin
[16:56]
yes
ep_attr is not going to be changed
b89abe2
to
39818b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed one little (but important) issue in the last code review. Sorry!
Other than the abort check stuff, this looks good to me.
.gitignore
Outdated
@@ -78,3 +78,6 @@ m4/lt~obsolete.m4 | |||
.idea/ | |||
.devenv/ | |||
.direnv | |||
|
|||
.vscode | |||
install/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #773 for a conversation among the team yesterday about editor droppings and .gitignore. I mostly agree with Nick on this one, so would suggest we just drop this commit.
src/nccl_ofi_api.c
Outdated
{ | ||
ncclResult_t validation_result = msg_length_verify_max_size(&size, 1); | ||
if (validation_result != ncclSuccess) { | ||
return validation_result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't catch this earlier, but this should be return check_return(validation_result)
so that the abort check happens correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/nccl_ofi_api.c
Outdated
sizes == NULL || tags == NULL || | ||
mhandles == NULL || request == NULL)) { | ||
NCCL_OFI_WARN("Invalid argument: NULL pointer detected"); | ||
return ncclInvalidArgument; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above about using return_check()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/nccl_ofi_api.c
Outdated
ncclResult_t nccl_net_ofi_irecv_v9(void* recvComm, int n, void** data, | ||
size_t* sizes, int* tags, void** mhandles, void** request) | ||
{ | ||
int sizesInt[NCCL_OFI_MAX_RECVS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you initialize it please? There's no need to memset or anything crazy like that, just = {};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/nccl_ofi_api.c
Outdated
*request = NULL; | ||
} | ||
|
||
ncclResult_t validation_result = msg_length_verify_max_size(sizes, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, can you elaborate it ? Do you mean explicitly cast to const when passing arguments to a function that expects const parameters? I think that's not mandatory, Implicit conversion is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm besides nits
src/nccl_ofi_api.c
Outdated
static inline ncclResult_t msg_length_verify_max_size(const size_t *sizes, const size_t len) { | ||
if (OFI_UNLIKELY(sizes == NULL)) { | ||
NCCL_OFI_WARN("Invalid argument: NULL pointer provided for sizes array"); | ||
return ncclInvalidArgument; | ||
} | ||
|
||
for (size_t i = 0; i < len; i++) { | ||
if (OFI_UNLIKELY(sizes[i] > INT_MAX)) { | ||
NCCL_OFI_WARN("Message size %zu exceeds maximum allowed size %d at index %zu", sizes[i], INT_MAX, i); | ||
return ncclInternalError; | ||
} | ||
} | ||
return ncclSuccess; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I dint get in on this PR fun earlier. I'm not a fan of this validation function only because the validation can fail only if NCCL doesn't honor the limit, which would be a programming error not runtime error. Why take the check on each send path instead of a debug assert?
For the function as-is, compare against the max size we reported to nccl, not INT_MAX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's completely undebuggable if it does happen. It'll be a silent truncation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare against the max size we reported to nccl, not INT_MAX
That'll require to call get_properties
and get from properties. This size check is in the send path, which requires API change. Since currently we know it's INT_MAX for both protos, can we use as it? When we work on changing the max to be more accurate for each proto, then we can look into how to use the max size we report to NCCL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Raghu's just saying "don't check at all", not to check against the actual properties. But all we're doing here is checking for truncation, and this should be a transient thing. The next set of patches should move the internal APIs from int to size_t (and then the copy will be from int -> size_t for the old functions), and that will remove this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's what I was saying. And I was suggesting leaning on debug mode runs in testing to catch what this check would have caught at runtime.... but leaving it in till we update the internal API sounds like a plan.
39818b2
to
3e88463
Compare
Update the Nvidia interface to support the v9 interface introduced in NCCL 2.24. The primary changes for NCCL are: 1. NIC fusion / vitual NIC support 2. API supports indication when completion is not needed (e.g. LL128) 3. DeviceProps supports max transfer size for p2p/coll operations
3e88463
to
3f7c1b3
Compare
Description of changes:
This PR is to support the v9 interface of the NCCL plugin interface:
Test
make passed
NCCL test all_reduce_perf completed without error
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.