Releases: DefTruth/ffpa-attn-mma
Releases · DefTruth/ffpa-attn-mma
v0.0.2
What's Changed
- [Misc] Add install.sh & clear.sh by @DefTruth in #2
- [Docs] Add approximate complexity analysis by @DefTruth in #3
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #4
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #5
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #6
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #7
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #8
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #9
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #10
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #11
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #12
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #13
- [FFPA] Refactor FFPA-L1 Part-2✔️ by @DefTruth in #14
- [test] Add gen bench table func✔️ by @DefTruth in #15
- [README] Update bench, 2x faster than SDPA✔️ by @DefTruth in #16
- [README] Update README.md by @DefTruth in #17
- [README] Update README.md by @DefTruth in #18
- [README] Update README.md by @DefTruth in #19
- [Bugfix] fix prefill.cuh un-used vars by @DefTruth in #20
- [FFPA] fix some macro typos by @DefTruth in #21
- [misc] fix setup.py by @DefTruth in #22
- [misc] update L20, 4090, A30, 3080 bench by @DefTruth in #23
- [FFPA] support L1 multi-stages 3/4 by @DefTruth in #24
- [Misc] find best tflops across multi-stages by @DefTruth in #25
- [FFPA] rename pyffpa -> ffpa_attn by @DefTruth in #26
- [README] Update README.md by @DefTruth in #27
- [FFPA] L1 support prefetch QKV g2s by @DefTruth in #28
- [Bugfix] fix d < 256 accuracy errors by @DefTruth in #29
- [Feature] support L1 QKV smem separation by @DefTruth in #30
- [Feature] Add ENABLE_FFPA_SMEM_SWIZZLE_V flag by @DefTruth in #31
- [bench] Add RTX 3080 Laptop perf plots by @DefTruth in #32
- [bench] Add more bench perf plots by @DefTruth in #34
- [misc] fix bench link typos by @DefTruth in #35
- [Docs] Add Docker image -> Installation⚙️ by @DefTruth in #36
- [Feature] Add mma mode & fully QKV swizzle by @DefTruth in #37
- [bench] update perf plots for qkv swizzle by @DefTruth in #38
- [bench] update perf plots for qkv swizzle by @DefTruth in #39
- [bench] update perf plots for qkv swizzle by @DefTruth in #40
- [README] Update python test cases by @DefTruth in #41
- [README] Update README.md by @DefTruth in #42
- [feat] add force ffpa Q*K^T mma acc f16 flag by @DefTruth in #43
- [Bugfix] fix ENABLE_FFPA_FORCE_QK_F16 typo by @DefTruth in #44
- [Docs] Add FFPA L1 kernel template signature by @DefTruth in #45
- [feat] support ffpa-l1 persist Q s2r by @DefTruth in #46
- [README] Update ffpa-attn logo by @DefTruth in #47
- [Misc] update ffpa-attn title & logo by @DefTruth in #48
- [Misc] update ffpa-attn title & logo by @DefTruth in #49
- [feat] support ffpa-l1 persist Q g2r by @DefTruth in #50
- [README] Update README.md by @DefTruth in #51
- [feat] Add ffpa-l1 launch_templates by @DefTruth in #52
- [feat] ffpa-l1 persist-qkv g2s for d<=256 by @DefTruth in #53
- [feat] tune block size for L1 persist kv g2s by @DefTruth in #54
- [feat] ffpa-l1 persist-kv-s2r for small d by @DefTruth in #55
- [feat] update ffpa-l1 small d kernel launch configs by @DefTruth in #56
- [README] Update README.md by @DefTruth in #57
- [Bugfix] fix compile error w/o V s2r by @DefTruth in #58
- [feat] refactor launch templates configs by @DefTruth in #59
- [Release] Bump up to v0.0.2 by @DefTruth in #60
- [Release] Bump up to v0.0.2 by @DefTruth in #61
Full Changelog: v0.0.1...v0.0.2
v0.0.1.post3
What's Changed
- [Docs] Add Docker image -> Installation⚙️ by @DefTruth in #36
- [Feature] Add mma mode & fully QKV swizzle by @DefTruth in #37
- [bench] update perf plots for qkv swizzle by @DefTruth in #38
- [bench] update perf plots for qkv swizzle by @DefTruth in #39
- [bench] update perf plots for qkv swizzle by @DefTruth in #40
Full Changelog: v0.0.1.post2...v0.0.1.post3
v0.0.1.post2
What's Changed
- [misc] fix setup.py by @DefTruth in #22
- [misc] update L20, 4090, A30, 3080 bench by @DefTruth in #23
- [FFPA] support L1 multi-stages 3/4 by @DefTruth in #24
- [Misc] find best tflops across multi-stages by @DefTruth in #25
- [FFPA] rename pyffpa -> ffpa_attn by @DefTruth in #26
- [README] Update README.md by @DefTruth in #27
- [FFPA] L1 support prefetch QKV g2s by @DefTruth in #28
- [Bugfix] fix d < 256 accuracy errors by @DefTruth in #29
- [Feature] support L1 QKV smem separation by @DefTruth in #30
- [Feature] Add ENABLE_FFPA_SMEM_SWIZZLE_V flag by @DefTruth in #31
- [bench] Add RTX 3080 Laptop perf plots by @DefTruth in #32
- [bench] Add more bench perf plots by @DefTruth in #34
- [misc] fix bench link typos by @DefTruth in #35
Full Changelog: v0.0.1.post1...v0.0.1.post2
FFPA 0.0.1.post1
What's Changed
- [Misc] Add install.sh & clear.sh by @DefTruth in #2
- [Docs] Add approximate complexity analysis by @DefTruth in #3
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #4
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #5
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #6
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #7
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #8
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #9
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #10
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #11
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #12
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #13
- [FFPA] Refactor FFPA-L1 Part-2✔️ by @DefTruth in #14
- [test] Add gen bench table func✔️ by @DefTruth in #15
- [README] Update bench, 2x faster than SDPA✔️ by @DefTruth in #16
- [README] Update README.md by @DefTruth in #17
- [README] Update README.md by @DefTruth in #18
- [README] Update README.md by @DefTruth in #19
- [Bugfix] fix prefill.cuh un-used vars by @DefTruth in #20
- [FFPA] fix some macro typos by @DefTruth in #21
Full Changelog: v0.0.1...v0.0.1.post1
🎉 cuffpa-py 0.0.1 beta L1 Release
📖 FFPA L1 (Level 1): Benchmark 🎉🎉
L1: level 1, O(Brx16)~O(1) SRAM complexity, O(d/4) register complexity, the same GPU HBM memory complexity as FlashAttention. B=1, H=48, N=8192, D=320-1024(FA2 not supported 👀). (Notes, *
=MMA Acc F32, ^
=MMA Acc F16, Softmax Acc dtype is always be F32, T=TFLOPS, 👇Benchmark)
- 📚 NVIDIA RTX 3080 Laptop (
*
=MMA Acc F32,^
=MMA Acc F16,T
=TFLOPS)
Algorithm | 320 | 384 | 448 | 512 | 576 | 640 | 704 | 768 | 832 | 896 | 960 | 1024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SDPA EA | 13T | 16T | 12T | 16T | 15T | 15T | 15T | 15T | 15T | 15T | 15T | 15T |
FFPA L1* | 32T | 30T | 30T | 28T | 28T | 27T | 26T | 25T | 25T | 25T | 25T | 24T |
Speedup | 2.48x | 1.88x | 2.55x | 1.75x | 1.90x | 1.77x | 1.73x | 1.67x | 1.66x | 1.66x | 1.66x | 1.54x |
FFPA L1^ | 40T | 38T | 39T | 36T | 35T | 34T | 33T | 32T | 31T | 31T | 28T | 27T |
Speedup | 3.07x | 2.42x | 3.33x | 2.24x | 2.35x | 2.19x | 2.19x | 2.13x | 2.03x | 2.03x | 1.90x | 1.74x |
- 📚 NVIDIA RTX 4090 (
*
=MMA Acc F32,^
=MMA Acc F16,T
=TFLOPS)
Algorithm | 320 | 384 | 448 | 512 | 576 | 640 | 704 | 768 | 832 | 896 | 960 | 1024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SDPA EA | 82T | 92T | 83T | 84T | 78T | 80T | 78T | 80T | 78T | 80T | 78T | 79T |
FFPA L1* | 136T | 135T | 135T | 132T | 133T | 133T | 132T | 131T | 130T | 125T | 123T | 93T |
Speedup | 1.64x | 1.45x | 1.61x | 1.57x | 1.71x | 1.65x | 1.68x | 1.62x | 1.65x | 1.56x | 1.55x | 1.17x |
FFPA L1^ | 154T | 161T | 160T | 157T | 156T | 155T | 157T | 154T | 149T | 150T | 145T | 100T |
Speedup | 1.85x | 1.73x | 1.92x | 1.87x | 1.99x | 1.93x | 1.99x | 1.90x | 1.90x | 1.88x | 1.84x | 1.25x |
- 📚 NVIDIA L20 (
*
=MMA Acc F32,^
=MMA Acc F16,T
=TFLOPS)
Algorithm | 320 | 384 | 448 | 512 | 576 | 640 | 704 | 768 | 832 | 896 | 960 | 1024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SDPA EA | 56T | 63T | 57T | 58T | 55T | 56T | 54T | 55T | 54T | 55T | 54T | 56T |
FFPA L1* | 99T | 95T | 95T | 93T | 94T | 92T | 92T | 90T | 89T | 90T | 90T | 89T |
Speedup | 1.77x | 1.49x | 1.64x | 1.58x | 1.72x | 1.65x | 1.68x | 1.63x | 1.64x | 1.63x | 1.67x | 1.58x |
FFPA L1^ | 96T | 99T | 100T | 92T | 93T | 92T | 93T | 91T | 90T | 90T | 88T | 91T |
Speedup | 1.71x | 1.55x | 1.73x | 1.56x | 1.69x | 1.65x | 1.71x | 1.64x | 1.65x | 1.63x | 1.62x | 1.62x |
- 📚 NVIDIA A30 (
*
=MMA Acc F32,^
=MMA Acc F16,T
=TFLOPS)
Algorithm | 320 | 384 | 448 | 512 | 576 | 640 | 704 | 768 | 832 | 896 | 960 | 1024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SDPA EA | 25T | 25T | 24T | 23T | 24T | 24T | 23T | 22T | 22T | 21T | 21T | 18T |
FFPA L1* | 33T | 33T | 32T | 31T | 32T | 32T | 30T | 28T | 25T | 24T | 24T | 24T |
Speedup | 1.33x | 1.33x | 1.30x | 1.31x | 1.33x | 1.33x | 1.32x | 1.23x | 1.15x | 1.11x | 1.11x | 1.27x |
FFPA L1^ | 33T | 33T | 33T | 30T | 31T | 32T | 31T | 30T | 30T | 27T | 24T | 23T |
Speedup | 1.33x | 1.33x | 1.36x | 1.30x | 1.31x | 1.33x | 1.37x | 1.35x | 1.35x | 1.25x | 1.11x | 1.25x |