From e63cf1b3745204f5d33928c6bb8e77c166b90bb3 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Wed, 15 Jan 2025 00:12:59 +0800 Subject: [PATCH] [bench] update perf plots for qkv swizzle (#40) --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f271f71..7edee51 100644 --- a/README.md +++ b/README.md @@ -13,17 +13,18 @@ πŸ€–[WIP] **FFPA**: Yet antother **Faster Flash Prefill Attention** with **O(1) SRAM complexity** & **O(d/4) or O(1) register complexity** for large headdim (D > 256), almost **1.5x~2x** πŸŽ‰ faster than SDPA EA with or without MMA Acc F32 on many devices: [πŸ“ˆL20 ~1.9xβ†‘πŸŽ‰](#L1-bench-l20), [πŸ“ˆ A30 ~1.8xβ†‘πŸŽ‰](#L1-bench-a30), [πŸ“ˆ3080 ~2.9xβ†‘πŸŽ‰](#L1-bench-3080), [πŸ“ˆ4090 ~2.1xβ†‘πŸŽ‰](#L1-bench-4090). - + + πŸ’‘NOTE: This project is still in its early dev stages and now provides some kernels and benchmarks for reference. More features will be added in the future. (Welcome to πŸŒŸπŸ‘†πŸ»star this repo to support me ~)