Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Untilize with unpadding only supports parallelization over the height #17537

Open
nardoTT opened this issue Feb 4, 2025 · 0 comments
Open

Untilize with unpadding only supports parallelization over the height #17537

nardoTT opened this issue Feb 4, 2025 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@nardoTT
Copy link
Contributor

nardoTT commented Feb 4, 2025

Describe the bug
The untilize with unpadding operation parallelizes the tensors along the height. Wide tensors are mapped to few cores only which affects performance

To Reproduce
Profile any wide tensor with untilize with unpadding operation and check the number of cores

Expected behavior
Using more cores for wide tensors

@nardoTT nardoTT added the bug Something isn't working label Feb 4, 2025
@nardoTT nardoTT self-assigned this Feb 4, 2025
nardoTT added a commit that referenced this issue Feb 5, 2025
…adding (#17538)

### Ticket
Link to Github Issue
#17537

### Problem description
Currently, the untilize with unpadding implementation supports
parallelization only along the height dimension. This affects perf for
wide tensors, as they are mapped to a limited number of cores.

### What's changed
In this PR, we introduce support for parallelizing the untiling
operation along the width dimension, similar to tilize with padding. The
operation executes the parallelization over the dimension with the
larger number of tiles.
In future versions: 
- we want the operation to support the parallelization along both
dimensions simultaneously
- we want the compute kernel to support the processing of an entire
column block at once instead of one tile at a time

For the tests added in test_to_layout.py, the kernel duration of the
previous implementation is around 1.8 to 24.8 times larger than the
current implementation 


### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/13121055787
- [ ] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant