Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tilized dram-interleaved as default input-output layout #1744

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jnie-TT
Copy link
Contributor

@jnie-TT jnie-TT commented Jan 10, 2025

Description

Part of the runtime stitching effort #1743.

This PR updates the default input/output layout to tiled dram-interleaved from system memory row-major.

Combined the runtime stitching APIs, this enables the user to pre-tilize and interleave tensors (such as weights) and reuse them over multiple programs, eliminating ping-ponging between host/dram, row-major/tile

IR Example

TTNN IR of simple_matmul test on main:

#system_memory = #ttnn.buffer_type<system_memory>
#ttnn_layout = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<64x128xbf16, #system_memory>>
#ttnn_layout1 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<128x96xbf16, #system_memory>>
#ttnn_layout2 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<64x96xbf16, #system_memory>>
#ttnn_layout3 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
#ttnn_layout4 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<4x3x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
#ttnn_layout5 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x3x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
module attributes {tt.device = #device, tt.system_desc = #system_desc} {
  func.func @forward(%arg0: tensor<64x128xbf16, #ttnn_layout>, %arg1: tensor<128x96xbf16, #ttnn_layout1>) -> tensor<64x96xbf16, #ttnn_layout2> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.to_device"(%arg0, %0) <{memory_config = #ttnn.memory_config<#dram, <<2x4>>, <interleaved>>}> : (tensor<64x128xbf16, #ttnn_layout>, !tt.device<#device>) -> tensor<64x128xbf16, #ttnn_layout3>
    %2 = "ttnn.to_layout"(%1) <{layout = #ttnn.layout<tile>}> : (tensor<64x128xbf16, #ttnn_layout3>) -> tensor<64x128xbf16, #ttnn_layout3>
    "ttnn.deallocate"(%1) <{force = false}> : (tensor<64x128xbf16, #ttnn_layout3>) -> ()
    %3 = "ttnn.to_device"(%arg1, %0) <{memory_config = #ttnn.memory_config<#dram, <<4x3>>, <interleaved>>}> : (tensor<128x96xbf16, #ttnn_layout1>, !tt.device<#device>) -> tensor<128x96xbf16, #ttnn_layout4>
    %4 = "ttnn.to_layout"(%3) <{layout = #ttnn.layout<tile>}> : (tensor<128x96xbf16, #ttnn_layout4>) -> tensor<128x96xbf16, #ttnn_layout4>
    "ttnn.deallocate"(%3) <{force = false}> : (tensor<128x96xbf16, #ttnn_layout4>) -> ()
    %5 = "ttnn.empty"(%0) <{dtype = #tt.supportedDataTypes<bf16>, layout = #ttnn.layout<tile>, memory_config = #ttnn.memory_config<#dram, <<2x3>>, <interleaved>>, shape = #ttnn.shape<64x96>}> : (!tt.device<#device>) -> tensor<64x96xbf16, #ttnn_layout5>
    %6 = "ttnn.matmul"(%2, %4, %5) : (tensor<64x128xbf16, #ttnn_layout3>, tensor<128x96xbf16, #ttnn_layout4>, tensor<64x96xbf16, #ttnn_layout5>) -> tensor<64x96xbf16, #ttnn_layout5>
    "ttnn.deallocate"(%4) <{force = false}> : (tensor<128x96xbf16, #ttnn_layout4>) -> ()
    "ttnn.deallocate"(%2) <{force = false}> : (tensor<64x128xbf16, #ttnn_layout3>) -> ()
    %7 = "ttnn.from_device"(%6) : (tensor<64x96xbf16, #ttnn_layout5>) -> tensor<64x96xbf16, #ttnn_layout2>
    "ttnn.deallocate"(%5) <{force = false}> : (tensor<64x96xbf16, #ttnn_layout5>) -> ()
    %8 = "ttnn.to_layout"(%7) <{layout = #ttnn.layout<row_major>}> : (tensor<64x96xbf16, #ttnn_layout2>) -> tensor<64x96xbf16, #ttnn_layout2>
    "ttnn.deallocate"(%7) <{force = false}> : (tensor<64x96xbf16, #ttnn_layout2>) -> ()
    return %8 : tensor<64x96xbf16, #ttnn_layout2>
  }
}

TTNN IR of simple_matmul test after this change:

#ttnn_layout = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
#ttnn_layout1 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<4x3x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
#ttnn_layout2 = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x3x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
module attributes {tt.device = #device, tt.system_desc = #system_desc} {
  func.func @forward(%arg0: tensor<64x128xbf16, #ttnn_layout>, %arg1: tensor<128x96xbf16, #ttnn_layout1>) -> tensor<64x96xbf16, #ttnn_layout2> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.empty"(%0) <{dtype = #tt.supportedDataTypes<bf16>, layout = #ttnn.layout<tile>, memory_config = #ttnn.memory_config<#dram, <<2x3>>, <interleaved>>, shape = #ttnn.shape<64x96>}> : (!tt.device<#device>) -> tensor<64x96xbf16, #ttnn_layout2>
    %2 = "ttnn.matmul"(%arg0, %arg1, %1) : (tensor<64x128xbf16, #ttnn_layout>, tensor<128x96xbf16, #ttnn_layout1>, tensor<64x96xbf16, #ttnn_layout2>) -> tensor<64x96xbf16, #ttnn_layout2>
    return %2 : tensor<64x96xbf16, #ttnn_layout2>
  }
}

Changes

TTNNLayout

  • Updated the default memory space to dram, tensor memory layout to interleaved, and layout to tiled.
  • Moved force row major logic from the TTIRtoTTNN pass to this pass. This will determine whether or not to untilize the tensor. The issue with having the force row major logic in a downstream pass was that a toLayoutOp may not even be created in the first place, since the input is already defaulted to tile (thus no tilization would be needed).

TTIRToTTNN

  • Uplifted force row major logic to TTNNLayout Pass.

Optimizer

  • Added a workaround that moves GetDeviceOps to the front of the op schedule.

    • Hit an issue where GetDeviceOps were non-deterministically moved to the end of the schedule when running mnist_sharded test
    • I'll create a follow up issue for this to be properly fixed
  • Added a workaround that checks for ReturnOps in L1 usage calculation

    • Return ops were not considered when calculating L1 usage. This was fine before because we would always have a to_layout op at the end before returning, but now we could very likely return directly without any layout conversion.
    • I'll create a follow up issue for this to be properly fixed
  • Marked layout-forcing tests as XFail.

    • With this change it seems like the layout-forcing tests return incorrect results.
    • Thus marking these tests as XFail for now, I'll create a follow up issue for this to be properly fixed

Runtime

  • Added a workaround for runtime APIs to assume first device in the device mesh when sending tensors to device.
    • Currently there's no device attribute in TTNNLayoutAttr, and therefore runtime can't know which device the tensor belongs to. This workaround configures runtime to always assume the tensor belongs to the first device (device id 0) in the mesh.
    • Next task in-line is to add the device attribute to TTNNLayoutAttr. Once that's done we can remove the workaround.

MLIR Tests

  • Updated file-checks to adapt to new IR (e.g. removed anything that checked ttnn.to_device, redundant ttnn.to_layout etc.)
  • Expanded simple_eltwise to individual files.
    • Using a large file made it hard to isolate errors. This also matches what we're doing in Dialect and Perf.
    • Allows more complex/diverse testing per-op.

TODOs Before Merging

  • Frontends need to add a runtime::toHost call before memcpying tensors.

    • This is because tensors are now returned in tile layout - runtime::toHost accepts an untilize flag that will untilize the tensor.
  • Update TODO comments once proper issues are created (optimizer, runtime workaround).

@jnie-TT jnie-TT force-pushed the jnie/dram_interleaved_tiled_default_rebased branch 3 times, most recently from 1a31acb to d4c5383 Compare January 14, 2025 03:57
Copy link
Contributor

@nsmithtt nsmithtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the compiler portion, will take a look at the runtime part later today!

lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp Outdated Show resolved Hide resolved
currentL1Usage -= currentL1UsagePerOp[op].l1MemUsagePerUser;
currentL1UsagePerOp.erase(op);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbajraktariTT, can you review this file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @odjuricicTT, as @fbajraktariTT completed internship recently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnie-TT I'm not sure that this extra logic is needed. Was a test failing without this temp fix? If so, can you provide more details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@odjuricicTT there's an assert below that checks if the currentL1Usage is 0. This error only surfaces with my changes - it's fine without my changes because we always untilize (to_layout) before returning. However it's possible now with my changes that we will return directly without any intermediate ops between the current op and the return op, and this causes issues because we wouldn't have zeroed out currentL1Usage.

Since this function doesn't decrement l1 usage on return op, the assert will fire and say that the l1 usage is non 0. My change basically adds a check that if the consumer op is a return op, we decrement the l1 usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnie-TT Thanks! Your solution is fine for now. Just please file the followup issue and reference it in the comment.

opSchedule[func].erase(it);
opSchedule[func].insert(opSchedule[func].begin(), deviceOp);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@odjuricicTT, can you review this file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnie-TT The proper fix for this would be to add it here:
https://github.com/tenstorrent/tt-mlir/blob/main/lib/Dialect/TTNN/Analysis/DFShardingPolicy.cpp#L37

Try changing the if to check for GetDeviceOp as well as ToLayoutOp.

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp Show resolved Hide resolved

// TTNN Reshape does not support implicit tilization/untilization
// Therefore input output layouts should be the same
if (mlir::isa<ttir::ReshapeOp>(operation) && operandNumber == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should have attributes on the op that denote these kind of capabilities instead of having this code be special cased for a specific op. @sdjordjevicTT thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add an interface to all TTNN ops called shouldTilize that defaults to true and that ops can specialize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would be awesome to have, I know a lot of eltwise ops are facing a similar issue regarding data type, where some ops can typecast implicitly whereas some ops cannot. This results in the IR being misaligned with the actual runtime output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about this scenarios, do we have some examples?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these examples? i.e. reshape, conv2d, slice & embedding, or do you mean something else?

Copy link
Contributor Author

@jnie-TT jnie-TT Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdjordjevicTT if you mean the implicit typecast ops an example would be relational binary ops vs unary ops .
Relational operations take an output_dtype that we setting to typecast implicitly within the op:

template <BinaryOpType binary_op_type>
struct RelationalBinary {
    static Tensor invoke(
        uint8_t queue_id,
        const Tensor &input_tensor_a_arg,
        const Tensor &input_tensor_b_arg,
        const std::optional<const DataType> &output_dtype = std::nullopt,
        const std::optional<MemoryConfig> &memory_config = std::nullopt,
        std::optional<Tensor> optional_output_tensor = std::nullopt,
        std::optional<unary::FusedActivations> activations = std::nullopt,
        std::optional<unary::UnaryWithParam> input_tensor_a_activation = std::nullopt);

However unary ops do not:

template <UnaryOpType... unary_op_types>
Tensor ExecuteUnary<unary_op_types...>::invoke(
    const Tensor& input_tensor,
    const std::optional<MemoryConfig>& memory_config,
    const std::optional<Tensor>& optional_output_tensor) {

And our compiler doesn't distinguish between them, i.e. for unary ops it'll still assume the output tensor of the unary op is properly typecasted to the desired data type.

As for ops that don't support implicit tilization/untilization, some examples include reshape, concat, transpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there was a misunderstanding between us. :)

Regarding Conv, Slice, and Embedding, I'm aware that they require some inputs to be in a row-major layout. I'll address this by implementing the necessary layout workarounds. If the Metal developers decide not to support tile layout for them, then we can introduce a trait\interface to accommodate them.

Regarding the implicit conversions, I get it for the data_type, but how we are specifying whether the output is in tile\row major? By defining the optional_output_tensor? I see what can be the issue, if you have some row-major input, you want to keep it row-major output for such ops. We can think about adding the interface on an op level to support this.

I created issues on myself to follow up on this:

if (mlir::isa<ttir::Conv2dOp>(operation) ||
mlir::isa<ttir::SliceOp>(operation) ||
(mlir::isa<ttir::EmbeddingBackwardOp>(operation) &&
operandNumber < 2)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be cleaned up with the workarounds. We have tasks for each of these to cleanup.

runtime/include/tt/runtime/detail/workarounds.h Outdated Show resolved Hide resolved
@jnie-TT jnie-TT force-pushed the jnie/dram_interleaved_tiled_default_rebased branch from d4c5383 to 676c714 Compare January 14, 2025 19:59
lib/Dialect/TTNN/Transforms/TTNNLayout.cpp Show resolved Hide resolved
if (mlir::isa<ttir::Conv2dOp>(operation) ||
mlir::isa<ttir::SliceOp>(operation) ||
(mlir::isa<ttir::EmbeddingBackwardOp>(operation) &&
operandNumber < 2)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be cleaned up with the workarounds. We have tasks for each of these to cleanup.


// TTNN Reshape does not support implicit tilization/untilization
// Therefore input output layouts should be the same
if (mlir::isa<ttir::ReshapeOp>(operation) && operandNumber == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about this scenarios, do we have some examples?

@jnie-TT jnie-TT force-pushed the jnie/dram_interleaved_tiled_default_rebased branch from 676c714 to a9a8eff Compare January 15, 2025 22:09
Copy link
Contributor

@odjuricicTT odjuricicTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on Optimizer related changes, but looks good overall.

Requesting changes until optimizer layout overrides are fixed. I'll help with this.

@@ -4,12 +4,11 @@
// CHECK-DAG: #[[LOC_MATMUL_IN1:.*]] = loc("matmul_1_in_1_layout"(#loc3))
// CHECK-DAG: #[[LOC_MATMUL:.*]] = loc("matmul_1"(#loc3))
// CHECK-DAG: #[[IN_1_LAYOUT:.*]] = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<1x12x!tt.tile<32x32, bf16>, #l1_>, <interleaved>>

// XFAIL: *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, tt-explorer and some forge-fe tests depend on layout overrides working so this cannot be tackled in a followup PR.

@jnie-TT Do you have more context on why this is not working? I will also take a deeper look later today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@odjuricicTT I haven't looked deep into it. I suspect there may be some assumptions about the initial tensor location/layout in the optimizer that aren't valid anymore? With this change basically all initial tensors will be in dram in tile layout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnie-TT Here is the fix for one of the tests makred as XFAIL 0bf4366

The other one can stay XFAIL for now, just file a followup issue.

// API can determine the correct devices. Enabling this workaround will assume
// that a device tensor will reside in the L1/Dram of the first device (device
// id 0) of the device grid. This should be removed once we add the device
// grid information to the tensorDesc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is strategy on LayoutDesc that will be set to ::tt::target::DistributedTensorConfig::NONE for single chip setup. Or it will be set to some kind of multi-device distribution if set. LMK if this doesn't resolve this issue.

table LayoutDesc {                                                         
  stride: [int];                                                         
  oob_val: OOBVal;
  core_range_set: [Dim2dRange];
  memory_desc: MemoryDesc;
  strategy: DistributionStrategy;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reach out to @wooseokTT if you need help/interpreting its programming.

Copy link
Contributor Author

@jnie-TT jnie-TT Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsmithtt the strategy doesn't tell us which submesh a tensor belongs to though right? I remember that when I added it, it was used to specify the tensor distribution method across multi device (replicate, shard etc.).

I can use it to distinguish between single/multichip, but I don't know the mesh shape or mesh offset that the tensor is mapped to if it's multidevice. And I need this info if I want to move a tensor to a multidevice mesh in the toLayout API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the strategy, but for e.g. ShardTensor2D.shard_mesh does tell you the mesh shape. I think it's always inferred that the offset is implicitly [0, 0], @wooseokTT feel free to correct me if I'm wrong, but this is reflected from TTNN API which doesn't support arbitrary mesh offsets.

Copy link
Contributor Author

@jnie-TT jnie-TT Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right ShardTensor2D has the shard_mesh. But seems like the other ones don't... If all we're using is ShardTensor2D and offset 0, 0 then I guess I can just derive it from that. And maybe add an assert that checks the strategy must be ShardTensor2D. Does doing it this way make sense with how we're performing multichip operations?

Copy link
Contributor

@pilkicTT pilkicTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnie-TT I'm preparing for this change in tt-forge-fe, so i've tested this branch. The only issue i've observed is not related directly to your change, but it has been exposed by it.

For tilized tensors we can have cases when FEs will get wrong stride information (when trying to allocate buffers for them).

The problem occurs when serializing layout into the flatbuffer. And its due to the fact that we have tied stride calculation to the layout attribute, but as it is currently implemented, same layout attributes can produce different strides (depending on the logical shape of the tensor). So, we end up with a problem when serializing into the flatbuffer due to the way the caching mechanism works there (all tensors with the same layout attribute will be serialized exactly the same).

For e.g., all tensors with layout: #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<1x1x!tt.tile<32x32, bf16>, #dram>, <interleaved>> will end up having same strides. Even though they can have different logical shapes.

Note:
I am not sure if getting stride for tilized tensors even makes sense, so we might want to introduce a different mechanism.

@nsmithtt
Copy link
Contributor

The problem occurs when serializing layout into the flatbuffer. And its due to the fact that we have tied stride calculation to the layout attribute, but as it is currently implemented, same layout attributes can produce different strides (depending on the logical shape of the tensor).

Yikes!! That's a good catch. It seems our options are:

  • Copy tensor shape to layout in the IR to properly cache, kinda nasty to have it in multiple places
  • Symbolically represent stride in the LayoutDesc, runtime now has to calculate it
  • Move stride to the tensor and disable caching for flatbuffer tensor objects.

Open to additional ideas/alternatives

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants