-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][compiled graphs] Adjust overlap algorithm #49005
Conversation
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
2e394c1
to
ab6b61f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't really understand the upstream/downstream NCCL actors change. Can you either clarify with some examples/comments or see if the suggested change could work instead?
Check if this is a NCCL write operation that writes to an actor | ||
that is also one of its immediate upstream NCCL actors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this logic hard to understand. Reading the issue, would it work if we instead modified this to is_nccl_write_to_peer(self, upstream_actor)
, and make sure the caller passes in the right actor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(There may be more than one upstream_actors, so we need to pass them all in.)
is_nccl_write_to_upstream_actor
is a short hand and makes the caller code more concise. We can change it to the following at caller side (dag_node_operation.py:792) if you find it easier to understand:
Original:
overlapped_schedule[j].is_nccl_write_to_upstream_actor
New:
overlapped_schedule[j].is_nccl_write_to_peer(overlapped_schedule[j].upstream_actors)
@@ -1682,11 +1682,17 @@ def _generate_dag_operation_graph_node( | |||
method_name = exec_task.method_name | |||
actor_handle = dag_node._get_actor_handle() | |||
requires_nccl = dag_node.type_hint.requires_nccl() | |||
downstream_nccl_actors = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does downstream_nccl_actors
check whether the current node requires NCCL while uspstream_nccl_actors
checks whether the upstream node requires NCCL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I get the underlying question, for a dag_node
with NCCL type hint, downstream_nccl_actors
means the actors reading from this NCCL channel, and upstream_nccl_actors
means the actors writing to NCCL channels which are input for this dag_node
. "downstream" and "upstream" are relative to the dag_node
.
Signed-off-by: Rui Qiao <[email protected]>
3c53357
to
5d35b71
Compare
The relax algorithm would deadlock for 5d35b71 |
Why are these changes needed?
Adjust gpu communication overlap algorithm to be less conservative, see details in #48953
Related issue number
Closes #48953
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.