You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The untilize with unpadding operation parallelizes the tensors along the height. Wide tensors are mapped to few cores only which affects performance
To Reproduce
Profile any wide tensor with untilize with unpadding operation and check the number of cores
Expected behavior
Using more cores for wide tensors
The text was updated successfully, but these errors were encountered:
…adding (#17538)
### Ticket
Link to Github Issue
#17537
### Problem description
Currently, the untilize with unpadding implementation supports
parallelization only along the height dimension. This affects perf for
wide tensors, as they are mapped to a limited number of cores.
### What's changed
In this PR, we introduce support for parallelizing the untiling
operation along the width dimension, similar to tilize with padding. The
operation executes the parallelization over the dimension with the
larger number of tiles.
In future versions:
- we want the operation to support the parallelization along both
dimensions simultaneously
- we want the compute kernel to support the processing of an entire
column block at once instead of one tile at a time
For the tests added in test_to_layout.py, the kernel duration of the
previous implementation is around 1.8 to 24.8 times larger than the
current implementation
### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/13121055787
- [ ] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
Describe the bug
The untilize with unpadding operation parallelizes the tensors along the height. Wide tensors are mapped to few cores only which affects performance
To Reproduce
Profile any wide tensor with untilize with unpadding operation and check the number of cores
Expected behavior
Using more cores for wide tensors
The text was updated successfully, but these errors were encountered: