Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration) #16026

SeanNijjar · 2024-12-13T22:13:40Z

Ticket

Problem description

Without going to deep into the weeds, there were numerous reasons why CCLs needed to be fundamentally rewritten but to summarize some of the reasons:

Writing CCLs was not scalable from a development effort
- Even within a single op (e.g. all-gather) we need to be able to support many topologies (ring, line, mesh, tree, tree of mesh, etc.) and use cases (BW bound, latency bound, high reliability vs lower reliability with potentially better perf)
CCLs need to be able to be fused with just about any ops without it being a Herculean effort
New concepts like "async tensor" need to be supported to account for performance artifacts like (dispatch) skew between chips and to effectively hide latency of various operations
(minor) support the new fabric projects with CCLs

This PR is in no ways a "feature complete" version of the required changes. Primarily we are looking to merge the majority of the baseline functionality of the new CCL command interpreter infrastructure to unblock Llama TG work (and to avoid continual rebase and regression fixing loop), along with a few extras:

a new (experimental) all-gather implementation (functional and tested on linear topology. Other topologies will be added in the future)
- tested with pytests
a new (experimental) reduce scatter (line topology) (NOTE: since time of originally creating this PR, this had to be disabled due to a commit in runtime which regressed behaviour, causing the op not to launch properly - an issue is here)
- NOTE: This is not yet fully tested for numerical correctness in in python TTNN level but it is tested in C++ with subdevice and persistent fabric to ensure it runs to completion
- (NEAR) FUTURE WORK: numerical correctness tests at python level with persistent fabric

Initial test coverage

Gtests that provide basic coverage for the CCL Command interpreter running on the transitionary EDM fabric (both in persistent and non-persistent modes)
- Gtests for reduce scatter and all-gather also added
Basic all gather pytests

Future work will expand test coverage

What's changed

Lots to discuss here:

What's the command interpreter
How's it work
How do we build ops with it
What's new with these CCLs?

The bulk of this information is or will be included in a much larger doc that will be circulated more widely in the coming weeks so a summary is provided below (if you want more details before the doc is provided, ask and I will point you to what's in progress):

A new "command interpreter" kernel is provided which executes various different command types. Some commands map nearly directly to the low level noc API but others map to higher level operations.
High Level Operation Example:

Stream Tensor Slice (from: CB/addrgen) (to:raw addr, CB, (fabric) addrgen)

Low Level Command:

Wait for semaphore value
Send semaphore update
Raw Read/Write

These commands are specifiable on host and there is a whole optimization story for performance but to provide the general idea, here's the primary functional code needed for all-gather as an example (code reorganized for purpose of PR example - not 1:1 to all_gather_async_program.cpp:

// Create a "reader kernel" command stream
std::vector<ttnn::ccl::cmd::CclHostLowLevelWorkerCommand> reader_cmd_stream;
reader_cmd_stream.push_back(ttnn::ccl::cmd::uops::read_tensor_slice_to_cb(input_worker_slice_v2, src0_cb_index));


// Create a "writer kernel" command stream
std::vector<ttnn::ccl::cmd::CclHostLowLevelWorkerCommand> writer_cmd_stream;
// 1, do mcast of the tensor slice to all the destinations
writer_cmd_stream.push_back(ttnn::ccl::cmd::uops::fabric_write_cb_to_tensor_slice(
        output_worker_slice_v2, src0_cb_index, mcast_dest_args));
        
// Really, for all-gather, that's basically it - the rest of the code is to choose core placement and get info - l
// ike which core(s) are fabric endpoints to connect to fabric, etc.)

// Now pass the commands to the kernel(s)
ttnn::ccl::worker_detail::generate_multi_input_command_stream_kernel_rt_args(
            program,
            worker_sender_reader_kernel_id,
            ...,
            reader_cmd_stream,
            std::nullopt,
            std::nullopt,
            std::nullopt);
ttnn::ccl::worker_detail::generate_multi_input_command_stream_kernel_rt_args(
            program,
            worker_sender_writer_kernel_id,
            ...,
            writer_cmd_stream,
            std::nullopt,
            {forward_fabric_connection},
            {backward_fabric_connection});

With the above, operations such as fusion become far simpler (in some cases, trivial).

For example, in the case of fusing an all-reduce with split-qkv heads operation for example (note that the output side of all-reduce is basically all-gather in an optimized ring implementation), the basic fusion operation is first identifying the split/slice boundaries of split-qkv (this could potentially be obtained from the op directly) and propagating those cut lines to all of the tensor slices of the producer (like the tensor slices in the commands shown above) and simply splitting those slices and setting the correct output tensors for each accordingly.

Note that many commands can be added to each given command stream - all-gather is just very simple. Reduce scatter is an example of one that is more complicated.

Expanding to other operations:

Here are some simple examples

Send/receive

Take the all-gather as example, and rather than specifying an mcast on the tensor write command:

writer_cmd_stream.push_back(ttnn::ccl::cmd::uops::fabric_write_cb_to_tensor_slice(
        output_worker_slice_v2, src0_cb_index, mcast_dest_args))

you would unicast it to the desired destination (replace mcast_dest_args)

If running in synchronous tensor mode, add a command interpreter kernel at the destination chip with a wait_val command to wait on a sem inc. Append a seminc to the sender command stream

Broadcast

Invoke all-gather above but just from one chip.

If running in synchronous tensor mode, add a command interpreter kernel at all the destination chips with a wait_val command to wait on a sem inc. Append a fabric multicast seminc to the sender command stream.

Reduce

Build a tree on the cluster
Each producer chip unicast sends to the next node towards the root of the tree, send sync signal to downstream
- if not a leaf, perform partial reduction with your received data and your local data and forward to the next node toward the root
  - Add a wait val before accepting your input data
Root node can do any number of reductions to reduce the incoming data streams (ensuring to first sync on any input stream before consuming

We do something similar to the above for reduce scatter

Snapshot

Here's a snapshot of what some command streams look like in the currently in progress reduce scatter

Happy to provide more details if requested.

Note on APIs

These APIs are expected to be refined over time. In the mean-time, I have introduces the named "micro-ops" as commands to grant us some flexibilitiy in changing underlying command encodings (both on host and device). This will let us optimize and improve the "IR" over time without requiring constant op implementation updates.

This PR is still in draft because we need to move the new all-gather to experimental

Checklist

Newest pipelines after rebase and many additional updates:

Post commit: https://github.com/tenstorrent/tt-metal/actions/runs/12442513667
- same failure pattern as main
T3K: https://github.com/tenstorrent/tt-metal/actions/runs/12442056207
- same failure patten as main
TG: https://github.com/tenstorrent/tt-metal/actions/runs/12442057319
- Same failure pattern as main

tests/ttnn/unit_tests/gtests/ccl/kernels/fabric_erisc_datamover_sender_worker_sender.cpp

tests/ttnn/unit_tests/gtests/ccl/kernels/fabric_worker_sender_multi_input.cpp

tests/ttnn/unit_tests/gtests/ccl/test_ccl_reduce_scatter_host_helpers.cpp

tests/ttnn/unit_tests/gtests/ccl/test_fabric_erisc_data_mover_loopback_with_workers.cpp

ttnn/cpp/ttnn/operations/ccl/ccl_common.hpp

ttnn/cpp/ttnn/operations/ccl/common/host/ccl_command_stream_builders.cpp

ttnn/cpp/ttnn/operations/ccl/common/host/ccl_worker_builder.cpp

ttnn/cpp/ttnn/operations/ccl/common/host/ccl_worker_builder.hpp

jvegaTT · 2024-12-17T15:25:27Z

ttnn/cpp/ttnn/operations/ccl/common/kernels/ccl_send_reader.cpp

+            return addrgen_type{
+                .bank_base_address = tensor_address, .page_size = page_size, .data_format = get_dataformat(cb_id_in0)};
+        }
+    } else if constexpr (


If you change this to #ifdef and #else format you don't need to initialize the sharded parameters with dummy variables and values in the interleaved case

ttnn/cpp/ttnn/operations/ccl/common/kernels/ccl_send_reader_two_input.cpp

(unfortunately C++ all-gather test with persistent fabric is now regressed)

…gather through python API

Just some minor bugs with host side tensor slice work splitter. Following commit will add unit tests and fix it. Also need to fix override runtime args

…le chips in line reduce scatter

Release mode seems to expose some issue with fabric launch on subdevice sometimes failing

… when running back to back. The second test case always fails, regardless of which case. So we enable only one case for now so we can regress on it.

SeanNijjar force-pushed the snijjar/issue-15006 branch 6 times, most recently from 54c5386 to 2a6d19a Compare December 16, 2024 14:04

SeanNijjar marked this pull request as ready for review December 16, 2024 19:04

SeanNijjar requested review from jvegaTT, cfjchu, ayerofieiev-tt, dmakoviichuk-tt and TT-BrianLiu as code owners December 16, 2024 19:04