Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowering matmul_transpose_a with pack-peel-4-level-tiling #1036

Open
newling opened this issue Jan 16, 2025 · 0 comments
Open

Lowering matmul_transpose_a with pack-peel-4-level-tiling #1036

newling opened this issue Jan 16, 2025 · 0 comments

Comments

@newling
Copy link
Contributor

newling commented Jan 16, 2025

Can we use another dma dimension in L1:

The issue (symptom) is that before lower-to-aie, there is a connection used by 2 copies

controlcode { 
...
%19 = amdaie.npu.circular_dma_cpy_nd %connection_B_10([0, 0, 0] [64, 8, 4] [4, 256, 1], [0, 0] [64, 32] [32, 1])  
...
%23 = amdaie.npu.circular_dma_cpy_nd %connection_B_10([0, 0, 0] [64, 8, 4] [4, 256, 1], [1, 0, 0] [1, 64, 32] [2048, 32, 1]) 
...
}

which is not allowed (each copy must have its own connection). But why are these 2 copies above not merged in the preceding iree-amdaie-dma-composition pass ? Indeed they should can be combined to

%21 = amdaie.npu.circular_dma_cpy_nd %connection_A_21([0, 0, 0, 0] [2, 64, 8, 4] [0, 4, 256, 1], [0, 0] [128, 32] [32, 1]) 

but they are not because the maximum number of dimensions reported here for the target side is 3. Which means they cannot be combined, because the number of target dimensions after merging is 4 (see above, created by relaxing the maximum number of allowed dimensions).

This copy/connection is going from L2 to L1 -- are there really only 3 available dma channels in L1? I'm a bit confused about this, specifically about the availability of the 'inter' dimensions. It seems like there is 1 'inter' dim at all levels of the hierarchy (see here) but it is not usable in all situations (see here).

Ideally I would be able to use one more channel for this use case. It does seem to work (gives numerically correct result).

Alternative to increasing number of channels available:

If we can only use 3 dims, I'm fairly confident that the dma copies from L3 -> L2 -> L1 are using more permutations than needed, and the packing can be 'linearized' which would mean we don't need as many dma dimensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant