-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use tilized dram-interleaved as default input-output layout #1744
base: main
Are you sure you want to change the base?
Conversation
1a31acb
to
d4c5383
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the compiler portion, will take a look at the runtime part later today!
currentL1Usage -= currentL1UsagePerOp[op].l1MemUsagePerUser; | ||
currentL1UsagePerOp.erase(op); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbajraktariTT, can you review this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @odjuricicTT, as @fbajraktariTT completed internship recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnie-TT I'm not sure that this extra logic is needed. Was a test failing without this temp fix? If so, can you provide more details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@odjuricicTT there's an assert below that checks if the currentL1Usage
is 0. This error only surfaces with my changes - it's fine without my changes because we always untilize (to_layout) before returning. However it's possible now with my changes that we will return directly without any intermediate ops between the current op and the return op, and this causes issues because we wouldn't have zeroed out currentL1Usage
.
Since this function doesn't decrement l1 usage on return op, the assert will fire and say that the l1 usage is non 0. My change basically adds a check that if the consumer op is a return op, we decrement the l1 usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnie-TT Thanks! Your solution is fine for now. Just please file the followup issue and reference it in the comment.
opSchedule[func].erase(it); | ||
opSchedule[func].insert(opSchedule[func].begin(), deviceOp); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@odjuricicTT, can you review this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnie-TT The proper fix for this would be to add it here:
https://github.com/tenstorrent/tt-mlir/blob/main/lib/Dialect/TTNN/Analysis/DFShardingPolicy.cpp#L37
Try changing the if to check for GetDeviceOp as well as ToLayoutOp.
|
||
// TTNN Reshape does not support implicit tilization/untilization | ||
// Therefore input output layouts should be the same | ||
if (mlir::isa<ttir::ReshapeOp>(operation) && operandNumber == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should have attributes on the op that denote these kind of capabilities instead of having this code be special cased for a specific op. @sdjordjevicTT thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should add an interface to all TTNN ops called shouldTilize
that defaults to true
and that ops can specialize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would be awesome to have, I know a lot of eltwise ops are facing a similar issue regarding data type, where some ops can typecast implicitly whereas some ops cannot. This results in the IR being misaligned with the actual runtime output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about this scenarios, do we have some examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these examples? i.e. reshape, conv2d, slice & embedding, or do you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdjordjevicTT if you mean the implicit typecast ops an example would be relational binary ops vs unary ops .
Relational operations take an output_dtype that we setting to typecast implicitly within the op:
template <BinaryOpType binary_op_type>
struct RelationalBinary {
static Tensor invoke(
uint8_t queue_id,
const Tensor &input_tensor_a_arg,
const Tensor &input_tensor_b_arg,
const std::optional<const DataType> &output_dtype = std::nullopt,
const std::optional<MemoryConfig> &memory_config = std::nullopt,
std::optional<Tensor> optional_output_tensor = std::nullopt,
std::optional<unary::FusedActivations> activations = std::nullopt,
std::optional<unary::UnaryWithParam> input_tensor_a_activation = std::nullopt);
However unary ops do not:
template <UnaryOpType... unary_op_types>
Tensor ExecuteUnary<unary_op_types...>::invoke(
const Tensor& input_tensor,
const std::optional<MemoryConfig>& memory_config,
const std::optional<Tensor>& optional_output_tensor) {
And our compiler doesn't distinguish between them, i.e. for unary ops it'll still assume the output tensor of the unary op is properly typecasted to the desired data type.
As for ops that don't support implicit tilization/untilization, some examples include reshape, concat, transpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there was a misunderstanding between us. :)
Regarding Conv, Slice, and Embedding, I'm aware that they require some inputs to be in a row-major layout. I'll address this by implementing the necessary layout workarounds. If the Metal developers decide not to support tile layout for them, then we can introduce a trait\interface to accommodate them.
Regarding the implicit conversions, I get it for the data_type, but how we are specifying whether the output is in tile\row major? By defining the optional_output_tensor? I see what can be the issue, if you have some row-major input, you want to keep it row-major output for such ops. We can think about adding the interface on an op level to support this.
I created issues on myself to follow up on this:
if (mlir::isa<ttir::Conv2dOp>(operation) || | ||
mlir::isa<ttir::SliceOp>(operation) || | ||
(mlir::isa<ttir::EmbeddingBackwardOp>(operation) && | ||
operandNumber < 2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be cleaned up with the workarounds. We have tasks for each of these to cleanup.
d4c5383
to
676c714
Compare
if (mlir::isa<ttir::Conv2dOp>(operation) || | ||
mlir::isa<ttir::SliceOp>(operation) || | ||
(mlir::isa<ttir::EmbeddingBackwardOp>(operation) && | ||
operandNumber < 2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be cleaned up with the workarounds. We have tasks for each of these to cleanup.
|
||
// TTNN Reshape does not support implicit tilization/untilization | ||
// Therefore input output layouts should be the same | ||
if (mlir::isa<ttir::ReshapeOp>(operation) && operandNumber == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about this scenarios, do we have some examples?
676c714
to
a9a8eff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on Optimizer related changes, but looks good overall.
Requesting changes until optimizer layout overrides are fixed. I'll help with this.
@@ -4,12 +4,11 @@ | |||
// CHECK-DAG: #[[LOC_MATMUL_IN1:.*]] = loc("matmul_1_in_1_layout"(#loc3)) | |||
// CHECK-DAG: #[[LOC_MATMUL:.*]] = loc("matmul_1"(#loc3)) | |||
// CHECK-DAG: #[[IN_1_LAYOUT:.*]] = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<1x12x!tt.tile<32x32, bf16>, #l1_>, <interleaved>> | |||
|
|||
// XFAIL: * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, tt-explorer and some forge-fe tests depend on layout overrides working so this cannot be tackled in a followup PR.
@jnie-TT Do you have more context on why this is not working? I will also take a deeper look later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@odjuricicTT I haven't looked deep into it. I suspect there may be some assumptions about the initial tensor location/layout in the optimizer that aren't valid anymore? With this change basically all initial tensors will be in dram in tile layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// API can determine the correct devices. Enabling this workaround will assume | ||
// that a device tensor will reside in the L1/Dram of the first device (device | ||
// id 0) of the device grid. This should be removed once we add the device | ||
// grid information to the tensorDesc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is strategy
on LayoutDesc
that will be set to ::tt::target::DistributedTensorConfig::NONE
for single chip setup. Or it will be set to some kind of multi-device distribution if set. LMK if this doesn't resolve this issue.
table LayoutDesc {
stride: [int];
oob_val: OOBVal;
core_range_set: [Dim2dRange];
memory_desc: MemoryDesc;
strategy: DistributionStrategy;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reach out to @wooseokTT if you need help/interpreting its programming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsmithtt the strategy doesn't tell us which submesh a tensor belongs to though right? I remember that when I added it, it was used to specify the tensor distribution method across multi device (replicate, shard etc.).
I can use it to distinguish between single/multichip, but I don't know the mesh shape or mesh offset that the tensor is mapped to if it's multidevice. And I need this info if I want to move a tensor to a multidevice mesh in the toLayout API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the strategy, but for e.g. ShardTensor2D.shard_mesh
does tell you the mesh shape. I think it's always inferred that the offset is implicitly [0, 0]
, @wooseokTT feel free to correct me if I'm wrong, but this is reflected from TTNN API which doesn't support arbitrary mesh offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right ShardTensor2D has the shard_mesh. But seems like the other ones don't... If all we're using is ShardTensor2D and offset 0, 0 then I guess I can just derive it from that. And maybe add an assert that checks the strategy must be ShardTensor2D. Does doing it this way make sense with how we're performing multichip operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnie-TT I'm preparing for this change in tt-forge-fe
, so i've tested this branch. The only issue i've observed is not related directly to your change, but it has been exposed by it.
For tilized tensors we can have cases when FEs will get wrong stride information (when trying to allocate buffers for them).
The problem occurs when serializing layout into the flatbuffer. And its due to the fact that we have tied stride calculation to the layout attribute, but as it is currently implemented, same layout attributes can produce different strides (depending on the logical shape of the tensor). So, we end up with a problem when serializing into the flatbuffer due to the way the caching mechanism works there (all tensors with the same layout attribute will be serialized exactly the same).
For e.g., all tensors with layout: #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<1x1x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
will end up having same strides. Even though they can have different logical shapes.
Note:
I am not sure if getting stride for tilized tensors even makes sense, so we might want to introduce a different mechanism.
Yikes!! That's a good catch. It seems our options are:
Open to additional ideas/alternatives |
Description
Part of the runtime stitching effort #1743.
This PR updates the default input/output layout to tiled dram-interleaved from system memory row-major.
Combined the runtime stitching APIs, this enables the user to pre-tilize and interleave tensors (such as weights) and reuse them over multiple programs, eliminating ping-ponging between host/dram, row-major/tile
IR Example
TTNN IR of simple_matmul test on main:
TTNN IR of simple_matmul test after this change:
Changes
TTNNLayout
TTIRToTTNN
Optimizer
Added a workaround that moves GetDeviceOps to the front of the op schedule.
Added a workaround that checks for ReturnOps in L1 usage calculation
Marked layout-forcing tests as XFail.
Runtime
MLIR Tests
TODOs Before Merging
Frontends need to add a
runtime::toHost
call before memcpying tensors.runtime::toHost
accepts anuntilize
flag that will untilize the tensor.Update TODO comments once proper issues are created (optimizer, runtime workaround).