Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-picks for 1.9.0-aws release #375

Closed
wants to merge 9 commits into from

Conversation

rajachan
Copy link
Member

@rajachan rajachan commented Apr 4, 2024

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

bwbarrett and others added 9 commits April 2, 2024 15:23
Signed-off-by: Brian Barrett <[email protected]>
Signed-off-by: Raghu Raja <[email protected]>
nccl_message_transfer and ring tests are overwriting in the cleanup the error value
returned from previous calls. This is making the tests quite worthless,
because if a function fails the returned error will be overwritten by the
deallocate_buffer call, so if the deallocation is successful, the tests
succeeds. This commit fixes that by using a different variable for the
return values in the cleanup and makes the test fail if either a function
call fails, or the cleanup fails, or both.

Signed-off-by: Amedeo Sapio <[email protected]>
The exposed closeSend function points to blocked_send_close() which,
in turn calls send_close(). The value returned by send_close was ignored,
this commit is passing that value back to blocked_send_close.

Signed-off-by: Amedeo Sapio <[email protected]>
Add shims for v1 compat, add context arguments where needed. Also
refactor model code such that it can accept the context as a argument
instead of by global reference.

cr: https://code.amazon.com/reviews/CR-118885749
Added some branch prediction hints when we check for unlikely error
cases in the critical path. All but one are related to msgbuff calls.

Signed-off-by: Amedeo Sapio <[email protected]>
This would prevent failures when NVLS size is smaller than the NVLSTree
chunksize or silent fallback to a lower value, if NCCL were to change
defaults.

Signed-off-by: Raghu Raja <[email protected]>
This is a useful print to debug issues with algorithm selection at scale
without having to rebuild the plugin. While the actual costs calculated
are more developer-focused and can stay under TRACE, the final choice
should be available to INFO when the TUNING debug subsystem is enabled.

NCCL 2.20 moved their algo/proto selection print from a TRACE to INFO
for this reason as well.

Signed-off-by: Raghu Raja <[email protected]>
Signed-off-by: Raghu Raja <[email protected]>
@rajachan rajachan closed this Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants