Runtime Stitching Progress #1743

jnie-TT · 2025-01-10T04:14:06Z

This issue tracks the overall progress of runtime stitching.

Generality Features

These features reduce overhead and are generally applicable to most tasks, requiring minimal user intervention.

Runtime Stitching APIs [Runtime stitching APIs and sanity tests, ttnn runtime submit refactor #1301]
- Add multi-chip support for some APIs like memcpy [TODO]
Update front ends to new submit API and remove legacy API [Runtime cleanup: Remove legacy submit API, update includes #1620]
Update compiler default input/output layout to dram-interleaved tilized [In Progress Use tilized dram-interleaved as default input-output layout #1744]
Add device attribute to ttnn layout attr, so that runtime APIs know which device a tensor belongs to [TODO]

These features provide fine-grained control and aggressive performance optimization, but require task-specific user configuration.

Add compile hints in the compiler such that the user can toggle input/output layout (dram/l1, interleaved/sharded, row_major/tiled) [TODO]
Add compile hints in the compiler such that the user can toggle input persistency (volatile vs persistent). Persistent inputs will not be deallocated within the graph, whereas volatile inputs will be deallocated to free up memory once it has no more users [TODO]
Add compile hints in the compiler such that the user can toggle input/output device mesh grid and offset, this eliminates the need to redistribute tensors across multi-device for every program [TODO]

jnie-TT self-assigned this Jan 10, 2025

jnie-TT mentioned this issue Jan 10, 2025

Use tilized dram-interleaved as default input-output layout #1744

Open