Releases · DefTruth/ffpa-attn-mma · GitHub

23 Jan 02:23

DefTruth

v0.0.2 Latest

Latest

What's Changed

[Misc] Add install.sh & clear.sh by @DefTruth in #2
[Docs] Add approximate complexity analysis by @DefTruth in #3
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #4
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #5
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #6
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #7
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #8
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #9
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #10
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #11
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #12
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #13
[FFPA] Refactor FFPA-L1 Part-2✔️ by @DefTruth in #14
[test] Add gen bench table func✔️ by @DefTruth in #15
[README] Update bench, 2x faster than SDPA✔️ by @DefTruth in #16
[README] Update README.md by @DefTruth in #17
[README] Update README.md by @DefTruth in #18
[README] Update README.md by @DefTruth in #19
[Bugfix] fix prefill.cuh un-used vars by @DefTruth in #20
[FFPA] fix some macro typos by @DefTruth in #21
[misc] fix setup.py by @DefTruth in #22
[misc] update L20, 4090, A30, 3080 bench by @DefTruth in #23
[FFPA] support L1 multi-stages 3/4 by @DefTruth in #24
[Misc] find best tflops across multi-stages by @DefTruth in #25
[FFPA] rename pyffpa -> ffpa_attn by @DefTruth in #26
[README] Update README.md by @DefTruth in #27
[FFPA] L1 support prefetch QKV g2s by @DefTruth in #28
[Bugfix] fix d < 256 accuracy errors by @DefTruth in #29
[Feature] support L1 QKV smem separation by @DefTruth in #30
[Feature] Add ENABLE_FFPA_SMEM_SWIZZLE_V flag by @DefTruth in #31
[bench] Add RTX 3080 Laptop perf plots by @DefTruth in #32
[bench] Add more bench perf plots by @DefTruth in #34
[misc] fix bench link typos by @DefTruth in #35
[Docs] Add Docker image -> Installation⚙️ by @DefTruth in #36
[Feature] Add mma mode & fully QKV swizzle by @DefTruth in #37
[bench] update perf plots for qkv swizzle by @DefTruth in #38
[bench] update perf plots for qkv swizzle by @DefTruth in #39
[bench] update perf plots for qkv swizzle by @DefTruth in #40
[README] Update python test cases by @DefTruth in #41
[README] Update README.md by @DefTruth in #42
[feat] add force ffpa Q*K^T mma acc f16 flag by @DefTruth in #43
[Bugfix] fix ENABLE_FFPA_FORCE_QK_F16 typo by @DefTruth in #44
[Docs] Add FFPA L1 kernel template signature by @DefTruth in #45
[feat] support ffpa-l1 persist Q s2r by @DefTruth in #46
[README] Update ffpa-attn logo by @DefTruth in #47
[Misc] update ffpa-attn title & logo by @DefTruth in #48
[Misc] update ffpa-attn title & logo by @DefTruth in #49
[feat] support ffpa-l1 persist Q g2r by @DefTruth in #50
[README] Update README.md by @DefTruth in #51
[feat] Add ffpa-l1 launch_templates by @DefTruth in #52
[feat] ffpa-l1 persist-qkv g2s for d<=256 by @DefTruth in #53
[feat] tune block size for L1 persist kv g2s by @DefTruth in #54
[feat] ffpa-l1 persist-kv-s2r for small d by @DefTruth in #55
[feat] update ffpa-l1 small d kernel launch configs by @DefTruth in #56
[README] Update README.md by @DefTruth in #57
[Bugfix] fix compile error w/o V s2r by @DefTruth in #58
[feat] refactor launch templates configs by @DefTruth in #59
[Release] Bump up to v0.0.2 by @DefTruth in #60
[Release] Bump up to v0.0.2 by @DefTruth in #61

Full Changelog: v0.0.1...v0.0.2

Contributors

DefTruth

Assets 2

14 Jan 16:16

DefTruth

v0.0.1.post3

What's Changed

[Docs] Add Docker image -> Installation⚙️ by @DefTruth in #36
[Feature] Add mma mode & fully QKV swizzle by @DefTruth in #37
[bench] update perf plots for qkv swizzle by @DefTruth in #38
[bench] update perf plots for qkv swizzle by @DefTruth in #39
[bench] update perf plots for qkv swizzle by @DefTruth in #40

Full Changelog: v0.0.1.post2...v0.0.1.post3

Contributors

DefTruth

Assets 2

14 Jan 04:58

DefTruth

v0.0.1.post2

What's Changed

[misc] fix setup.py by @DefTruth in #22
[misc] update L20, 4090, A30, 3080 bench by @DefTruth in #23
[FFPA] support L1 multi-stages 3/4 by @DefTruth in #24
[Misc] find best tflops across multi-stages by @DefTruth in #25
[FFPA] rename pyffpa -> ffpa_attn by @DefTruth in #26
[README] Update README.md by @DefTruth in #27
[FFPA] L1 support prefetch QKV g2s by @DefTruth in #28
[Bugfix] fix d < 256 accuracy errors by @DefTruth in #29
[Feature] support L1 QKV smem separation by @DefTruth in #30
[Feature] Add ENABLE_FFPA_SMEM_SWIZZLE_V flag by @DefTruth in #31
[bench] Add RTX 3080 Laptop perf plots by @DefTruth in #32
[bench] Add more bench perf plots by @DefTruth in #34
[misc] fix bench link typos by @DefTruth in #35

Full Changelog: v0.0.1.post1...v0.0.1.post2

Contributors

DefTruth

Assets 2

09 Jan 12:55

DefTruth

FFPA 0.0.1.post1

What's Changed

[Misc] Add install.sh & clear.sh by @DefTruth in #2
[Docs] Add approximate complexity analysis by @DefTruth in #3
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #4
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #5
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #6
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #7
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #8
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #9
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #10
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #11
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #12
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #13
[FFPA] Refactor FFPA-L1 Part-2✔️ by @DefTruth in #14
[test] Add gen bench table func✔️ by @DefTruth in #15
[README] Update bench, 2x faster than SDPA✔️ by @DefTruth in #16
[README] Update README.md by @DefTruth in #17
[README] Update README.md by @DefTruth in #18
[README] Update README.md by @DefTruth in #19
[Bugfix] fix prefill.cuh un-used vars by @DefTruth in #20
[FFPA] fix some macro typos by @DefTruth in #21

Full Changelog: v0.0.1...v0.0.1.post1

Contributors

DefTruth

Assets 2

06 Jan 05:42

DefTruth

🎉 cuffpa-py 0.0.1 beta L1 Release

📖 FFPA L1 (Level 1): Benchmark 🎉🎉

L1: level 1, O(Brx16)~O(1) SRAM complexity, O(d/4) register complexity, the same GPU HBM memory complexity as FlashAttention. B=1, H=48, N=8192, D=320-1024(FA2 not supported 👀). (Notes, *=MMA Acc F32, ^=MMA Acc F16, Softmax Acc dtype is always be F32, T=TFLOPS, 👇Benchmark)

📚 NVIDIA RTX 3080 Laptop (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)

Algorithm	320	384	448	512	576	640	704	768	832	896	960	1024
SDPA EA	13T	16T	12T	16T	15T	15T	15T	15T	15T	15T	15T	15T
FFPA L1*	32T	30T	30T	28T	28T	27T	26T	25T	25T	25T	25T	24T
Speedup	2.48x	1.88x	2.55x	1.75x	1.90x	1.77x	1.73x	1.67x	1.66x	1.66x	1.66x	1.54x
FFPA L1^	40T	38T	39T	36T	35T	34T	33T	32T	31T	31T	28T	27T
Speedup	3.07x	2.42x	3.33x	2.24x	2.35x	2.19x	2.19x	2.13x	2.03x	2.03x	1.90x	1.74x

📚 NVIDIA RTX 4090 (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)

Algorithm	320	384	448	512	576	640	704	768	832	896	960	1024
SDPA EA	82T	92T	83T	84T	78T	80T	78T	80T	78T	80T	78T	79T
FFPA L1*	136T	135T	135T	132T	133T	133T	132T	131T	130T	125T	123T	93T
Speedup	1.64x	1.45x	1.61x	1.57x	1.71x	1.65x	1.68x	1.62x	1.65x	1.56x	1.55x	1.17x
FFPA L1^	154T	161T	160T	157T	156T	155T	157T	154T	149T	150T	145T	100T
Speedup	1.85x	1.73x	1.92x	1.87x	1.99x	1.93x	1.99x	1.90x	1.90x	1.88x	1.84x	1.25x

📚 NVIDIA L20 (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)

Algorithm	320	384	448	512	576	640	704	768	832	896	960	1024
SDPA EA	56T	63T	57T	58T	55T	56T	54T	55T	54T	55T	54T	56T
FFPA L1*	99T	95T	95T	93T	94T	92T	92T	90T	89T	90T	90T	89T
Speedup	1.77x	1.49x	1.64x	1.58x	1.72x	1.65x	1.68x	1.63x	1.64x	1.63x	1.67x	1.58x
FFPA L1^	96T	99T	100T	92T	93T	92T	93T	91T	90T	90T	88T	91T
Speedup	1.71x	1.55x	1.73x	1.56x	1.69x	1.65x	1.71x	1.64x	1.65x	1.63x	1.62x	1.62x

📚 NVIDIA A30 (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)

Algorithm	320	384	448	512	576	640	704	768	832	896	960	1024
SDPA EA	25T	25T	24T	23T	24T	24T	23T	22T	22T	21T	21T	18T
FFPA L1*	33T	33T	32T	31T	32T	32T	30T	28T	25T	24T	24T	24T
Speedup	1.33x	1.33x	1.30x	1.31x	1.33x	1.33x	1.32x	1.23x	1.15x	1.11x	1.11x	1.27x
FFPA L1^	33T	33T	33T	30T	31T	32T	31T	30T	30T	27T	24T	23T
Speedup	1.33x	1.33x	1.36x	1.30x	1.31x	1.33x	1.37x	1.35x	1.35x	1.25x	1.11x	1.25x

Assets 2