Restructure pre-packed matrix layouts so generic GEMM code can be agnostic of panel layout #511

robertknight · 2025-01-04T09:52:42Z

#510 noted that the layout of pre-packed matrices required the generic GEMM code to know about the layout of data within a packed panel. This revises the layout so that is no longer the case. See the last commit for details. After this change it will be possible to use different panel layouts for each kernel. int8 kernels for example will use dot product instructions that require a different block layout than f32 kernels using FMA instructions.

Along the way I also added some comments about where the cache and register blocking sizes come from. These are explained in the referenced paper, but it is useful to have them more immediately accessible.

These are documented in the papers referenced in other comments in this module, but add comments inline that are more immediately accessible.

Each kernel now manages a temporary tile of an appropriate size, so the generic GEMM outer loops don't need to know the maximum size.

Divide prepacked into depth blocks with a size that matches the depth block size (`kc`) used during computation. This allows for the generic GEMM code to be agnostic of the layout of panels within a block, as it no longer needs to be able to slice panels along the depth dimension. Instead it just uses a depth block index to pick a panel with the pre-determined block size. This in turn gives the GEMM kernel freedom to choose the layout within each panel. This partly undoes #482.

robertknight changed the title ~~Restructure prepacked matrix layouts so generic GEMM code can be agnostic of panel layout~~ Restructure pre-packed matrix layouts so generic GEMM code can be agnostic of panel layout Jan 4, 2025

robertknight added 3 commits January 4, 2025 10:02

Add notes about where each block and tile size value comes from

a6c39c5

These are documented in the papers referenced in other comments in this module, but add comments inline that are more immediately accessible.

Remove obsolete MAX_TILE_ELEMENTS constant

1168588

Each kernel now manages a temporary tile of an appropriate size, so the generic GEMM outer loops don't need to know the maximum size.

robertknight force-pushed the gemm-depth-block branch from 640017b to 6ec1b1b Compare January 4, 2025 10:03

robertknight merged commit c8c1bc0 into main Jan 4, 2025
2 checks passed

robertknight deleted the gemm-depth-block branch January 4, 2025 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure pre-packed matrix layouts so generic GEMM code can be agnostic of panel layout #511

Restructure pre-packed matrix layouts so generic GEMM code can be agnostic of panel layout #511

robertknight commented Jan 4, 2025

Restructure pre-packed matrix layouts so generic GEMM code can be agnostic of panel layout #511

Restructure pre-packed matrix layouts so generic GEMM code can be agnostic of panel layout #511

Conversation

robertknight commented Jan 4, 2025