-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration) #16026
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SeanNijjar
force-pushed
the
snijjar/issue-15006
branch
6 times, most recently
from
December 16, 2024 14:04
54c5386
to
2a6d19a
Compare
SeanNijjar
requested review from
jvegaTT,
cfjchu,
ayerofieiev-tt,
dmakoviichuk-tt and
TT-BrianLiu
as code owners
December 16, 2024 19:04
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/kernels/fabric_erisc_datamover_sender_worker_sender.cpp
Outdated
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/kernels/fabric_erisc_datamover_sender_worker_sender.cpp
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/kernels/fabric_worker_sender_multi_input.cpp
Outdated
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/kernels/fabric_worker_sender_multi_input.cpp
Outdated
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/kernels/fabric_worker_sender_multi_input.cpp
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/kernels/fabric_worker_sender_multi_input.cpp
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/kernels/fabric_worker_sender_multi_input.cpp
Outdated
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/test_ccl_reduce_scatter_host_helpers.cpp
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/test_fabric_erisc_data_mover_loopback_with_workers.cpp
Outdated
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/test_fabric_erisc_data_mover_loopback_with_workers.cpp
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
tests/ttnn/unit_tests/gtests/ccl/test_fabric_erisc_data_mover_loopback_with_workers.cpp
Outdated
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
jvegaTT
reviewed
Dec 17, 2024
ttnn/cpp/ttnn/operations/ccl/common/host/ccl_command_stream_builders.cpp
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
ttnn/cpp/ttnn/operations/ccl/common/host/ccl_worker_builder.cpp
Outdated
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
ttnn/cpp/ttnn/operations/ccl/common/host/ccl_worker_builder.cpp
Outdated
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
ttnn/cpp/ttnn/operations/ccl/common/host/ccl_worker_builder.hpp
Outdated
Show resolved
Hide resolved
jvegaTT
reviewed
Dec 17, 2024
return addrgen_type{ | ||
.bank_base_address = tensor_address, .page_size = page_size, .data_format = get_dataformat(cb_id_in0)}; | ||
} | ||
} else if constexpr ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change this to #ifdef and #else format you don't need to initialize the sharded parameters with dummy variables and values in the interleaved case
jvegaTT
reviewed
Dec 17, 2024
ttnn/cpp/ttnn/operations/ccl/common/kernels/ccl_send_reader_two_input.cpp
Outdated
Show resolved
Hide resolved
(unfortunately C++ all-gather test with persistent fabric is now regressed)
…gather through python API
Just some minor bugs with host side tensor slice work splitter. Following commit will add unit tests and fix it. Also need to fix override runtime args
…le chips in line reduce scatter
Release mode seems to expose some issue with fabric launch on subdevice sometimes failing
… when running back to back. The second test case always fails, regardless of which case. So we enable only one case for now so we can regress on it.
SeanNijjar
force-pushed
the
snijjar/issue-15006
branch
from
December 21, 2024 06:47
61d0b55
to
54b8590
Compare
SeanNijjar
requested review from
rfurko-tt,
razorback3,
dongjin-na and
bbradelTT
as code owners
December 21, 2024 06:47
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ticket
Link to Github Issue
Problem description
Without going to deep into the weeds, there were numerous reasons why CCLs needed to be fundamentally rewritten but to summarize some of the reasons:
This PR is in no ways a "feature complete" version of the required changes. Primarily we are looking to merge the majority of the baseline functionality of the new CCL command interpreter infrastructure to unblock Llama TG work (and to avoid continual rebase and regression fixing loop), along with a few extras:
Initial test coverage
Future work will expand test coverage
What's changed
Lots to discuss here:
The bulk of this information is or will be included in a much larger doc that will be circulated more widely in the coming weeks so a summary is provided below (if you want more details before the doc is provided, ask and I will point you to what's in progress):
A new "command interpreter" kernel is provided which executes various different command types. Some commands map nearly directly to the low level noc API but others map to higher level operations.
High Level Operation Example:
Low Level Command:
These commands are specifiable on host and there is a whole optimization story for performance but to provide the general idea, here's the primary functional code needed for all-gather as an example (code reorganized for purpose of PR example - not 1:1 to
all_gather_async_program.cpp
:With the above, operations such as fusion become far simpler (in some cases, trivial).
For example, in the case of fusing an all-reduce with split-qkv heads operation for example (note that the output side of all-reduce is basically all-gather in an optimized ring implementation), the basic fusion operation is first identifying the split/slice boundaries of split-qkv (this could potentially be obtained from the op directly) and propagating those cut lines to all of the tensor slices of the producer (like the tensor slices in the commands shown above) and simply splitting those slices and setting the correct output tensors for each accordingly.
Note that many commands can be added to each given command stream - all-gather is just very simple. Reduce scatter is an example of one that is more complicated.
Expanding to other operations:
Here are some simple examples
Send/receive
you would unicast it to the desired destination (replace
mcast_dest_args
)If running in synchronous tensor mode, add a command interpreter kernel at the destination chip with a wait_val command to wait on a sem inc. Append a seminc to the sender command stream
Broadcast
Invoke all-gather above but just from one chip.
If running in synchronous tensor mode, add a command interpreter kernel at all the destination chips with a wait_val command to wait on a sem inc. Append a fabric multicast seminc to the sender command stream.
Reduce
We do something similar to the above for reduce scatter
Snapshot
Here's a snapshot of what some command streams look like in the currently in progress reduce scatter
Happy to provide more details if requested.
Note on APIs
These APIs are expected to be refined over time. In the mean-time, I have introduces the named "micro-ops" as commands to grant us some flexibilitiy in changing underlying command encodings (both on host and device). This will let us optimize and improve the "IR" over time without requiring constant op implementation updates.
This PR is still in draft because we need to move the new all-gather to experimental
Checklist
Newest pipelines after rebase and many additional updates: