Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration) #16026

Merged
merged 40 commits into from
Dec 21, 2024

Conversation

SeanNijjar
Copy link
Contributor

@SeanNijjar SeanNijjar commented Dec 13, 2024

Ticket

Link to Github Issue

Problem description

Without going to deep into the weeds, there were numerous reasons why CCLs needed to be fundamentally rewritten but to summarize some of the reasons:

  • Writing CCLs was not scalable from a development effort
    • Even within a single op (e.g. all-gather) we need to be able to support many topologies (ring, line, mesh, tree, tree of mesh, etc.) and use cases (BW bound, latency bound, high reliability vs lower reliability with potentially better perf)
  • CCLs need to be able to be fused with just about any ops without it being a Herculean effort
  • New concepts like "async tensor" need to be supported to account for performance artifacts like (dispatch) skew between chips and to effectively hide latency of various operations
  • (minor) support the new fabric projects with CCLs

This PR is in no ways a "feature complete" version of the required changes. Primarily we are looking to merge the majority of the baseline functionality of the new CCL command interpreter infrastructure to unblock Llama TG work (and to avoid continual rebase and regression fixing loop), along with a few extras:

  • a new (experimental) all-gather implementation (functional and tested on linear topology. Other topologies will be added in the future)
    • tested with pytests
  • a new (experimental) reduce scatter (line topology) (NOTE: since time of originally creating this PR, this had to be disabled due to a commit in runtime which regressed behaviour, causing the op not to launch properly - an issue is here)
    • NOTE: This is not yet fully tested for numerical correctness in in python TTNN level but it is tested in C++ with subdevice and persistent fabric to ensure it runs to completion
    • (NEAR) FUTURE WORK: numerical correctness tests at python level with persistent fabric

Initial test coverage

  • Gtests that provide basic coverage for the CCL Command interpreter running on the transitionary EDM fabric (both in persistent and non-persistent modes)
    • Gtests for reduce scatter and all-gather also added
  • Basic all gather pytests

Future work will expand test coverage

What's changed

Lots to discuss here:

  • What's the command interpreter
  • How's it work
  • How do we build ops with it
  • What's new with these CCLs?

The bulk of this information is or will be included in a much larger doc that will be circulated more widely in the coming weeks so a summary is provided below (if you want more details before the doc is provided, ask and I will point you to what's in progress):

A new "command interpreter" kernel is provided which executes various different command types. Some commands map nearly directly to the low level noc API but others map to higher level operations.
High Level Operation Example:

  • Stream Tensor Slice (from: CB/addrgen) (to:raw addr, CB, (fabric) addrgen)

Low Level Command:

  • Wait for semaphore value
  • Send semaphore update
  • Raw Read/Write

These commands are specifiable on host and there is a whole optimization story for performance but to provide the general idea, here's the primary functional code needed for all-gather as an example (code reorganized for purpose of PR example - not 1:1 to all_gather_async_program.cpp:

// Create a "reader kernel" command stream
std::vector<ttnn::ccl::cmd::CclHostLowLevelWorkerCommand> reader_cmd_stream;
reader_cmd_stream.push_back(ttnn::ccl::cmd::uops::read_tensor_slice_to_cb(input_worker_slice_v2, src0_cb_index));


// Create a "writer kernel" command stream
std::vector<ttnn::ccl::cmd::CclHostLowLevelWorkerCommand> writer_cmd_stream;
// 1, do mcast of the tensor slice to all the destinations
writer_cmd_stream.push_back(ttnn::ccl::cmd::uops::fabric_write_cb_to_tensor_slice(
        output_worker_slice_v2, src0_cb_index, mcast_dest_args));
        
// Really, for all-gather, that's basically it - the rest of the code is to choose core placement and get info - l
// ike which core(s) are fabric endpoints to connect to fabric, etc.)

// Now pass the commands to the kernel(s)
ttnn::ccl::worker_detail::generate_multi_input_command_stream_kernel_rt_args(
            program,
            worker_sender_reader_kernel_id,
            ...,
            reader_cmd_stream,
            std::nullopt,
            std::nullopt,
            std::nullopt);
ttnn::ccl::worker_detail::generate_multi_input_command_stream_kernel_rt_args(
            program,
            worker_sender_writer_kernel_id,
            ...,
            writer_cmd_stream,
            std::nullopt,
            {forward_fabric_connection},
            {backward_fabric_connection});

With the above, operations such as fusion become far simpler (in some cases, trivial).

For example, in the case of fusing an all-reduce with split-qkv heads operation for example (note that the output side of all-reduce is basically all-gather in an optimized ring implementation), the basic fusion operation is first identifying the split/slice boundaries of split-qkv (this could potentially be obtained from the op directly) and propagating those cut lines to all of the tensor slices of the producer (like the tensor slices in the commands shown above) and simply splitting those slices and setting the correct output tensors for each accordingly.

Note that many commands can be added to each given command stream - all-gather is just very simple. Reduce scatter is an example of one that is more complicated.

Expanding to other operations:

Here are some simple examples

Send/receive

  • Take the all-gather as example, and rather than specifying an mcast on the tensor write command:
writer_cmd_stream.push_back(ttnn::ccl::cmd::uops::fabric_write_cb_to_tensor_slice(
        output_worker_slice_v2, src0_cb_index, mcast_dest_args))

you would unicast it to the desired destination (replace mcast_dest_args)

If running in synchronous tensor mode, add a command interpreter kernel at the destination chip with a wait_val command to wait on a sem inc. Append a seminc to the sender command stream

Broadcast

Invoke all-gather above but just from one chip.

If running in synchronous tensor mode, add a command interpreter kernel at all the destination chips with a wait_val command to wait on a sem inc. Append a fabric multicast seminc to the sender command stream.

Reduce

  • Build a tree on the cluster
  • Each producer chip unicast sends to the next node towards the root of the tree, send sync signal to downstream
    • if not a leaf, perform partial reduction with your received data and your local data and forward to the next node toward the root
      • Add a wait val before accepting your input data
  • Root node can do any number of reductions to reduce the incoming data streams (ensuring to first sync on any input stream before consuming

We do something similar to the above for reduce scatter

Snapshot

Here's a snapshot of what some command streams look like in the currently in progress reduce scatter
Uploading image.png…

Happy to provide more details if requested.

Note on APIs

These APIs are expected to be refined over time. In the mean-time, I have introduces the named "micro-ops" as commands to grant us some flexibilitiy in changing underlying command encodings (both on host and device). This will let us optimize and improve the "IR" over time without requiring constant op implementation updates.

This PR is still in draft because we need to move the new all-gather to experimental

Checklist

Newest pipelines after rebase and many additional updates:

@SeanNijjar SeanNijjar force-pushed the snijjar/issue-15006 branch 6 times, most recently from 54c5386 to 2a6d19a Compare December 16, 2024 14:04
@SeanNijjar SeanNijjar marked this pull request as ready for review December 16, 2024 19:04
return addrgen_type{
.bank_base_address = tensor_address, .page_size = page_size, .data_format = get_dataformat(cb_id_in0)};
}
} else if constexpr (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change this to #ifdef and #else format you don't need to initialize the sharded parameters with dummy variables and values in the interleaved case

SeanNijjar and others added 23 commits December 21, 2024 03:12
(unfortunately C++ all-gather test with persistent fabric is now regressed)
Just some minor bugs with host side tensor slice work splitter. Following commit will add unit tests and fix it.

Also need to fix override runtime args
Release mode seems to expose some issue with fabric launch on subdevice sometimes failing
… when running back to back.

The second test case always fails, regardless of which case. So we enable only one case for now so we can regress on it.
@SeanNijjar SeanNijjar merged commit 4f5f417 into main Dec 21, 2024
183 of 184 checks passed
@SeanNijjar SeanNijjar deleted the snijjar/issue-15006 branch December 21, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants