-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load data before comms and nccl abort to solve issues with empty partitions #464
load data before comms and nccl abort to solve issues with empty partitions #464
Conversation
build |
@@ -326,7 +326,7 @@ def test_nearest_neighbors( | |||
random_state=0, | |||
) # make_blobs creates a random dataset of isotropic gaussian blobs. | |||
|
|||
# set average norm to be 1 to allow comparisons with default error thresholds | |||
# set average norm sq to be 1 to allow comparisons with default error thresholds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was still wrong so correcting in this PR.
Did some benchmarking with 2 gpus and the change (additional barrier) seems to add 1sec of latency to fit. |
Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
Signed-off-by: Erik Ordentlich <[email protected]>
2e609ac
to
5642383
Compare
build |
Moving data loading before nccl comms setup seems to also solve the issues without the extra barrier in previous version. Tested with uvm on databricks and seems to be ok. |
In further testing of handling of empty partitions added previously on workstation and dgx (local mode) found that even though we now raise an exception in this case, this can still result in hangs in python workers. Sometimes the job hangs (dgx) and other times exits but the python worker isn't killed (docker on local workstation).
Loading data before setting up comms (to prevent a worker with data from racing ahead and starting e.g an allreduce) and the nccl abort seems to clean things up in these cases.
I don't think this is the root cause of #453 but should be addressed.