What's Changed
- [Misc] Add install.sh & clear.sh by @DefTruth in #2
- [Docs] Add approximate complexity analysis by @DefTruth in #3
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #4
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #5
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #6
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #7
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #8
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #9
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #10
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #11
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #12
- [FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #13
- [FFPA] Refactor FFPA-L1 Part-2✔️ by @DefTruth in #14
- [test] Add gen bench table func✔️ by @DefTruth in #15
- [README] Update bench, 2x faster than SDPA✔️ by @DefTruth in #16
- [README] Update README.md by @DefTruth in #17
- [README] Update README.md by @DefTruth in #18
- [README] Update README.md by @DefTruth in #19
- [Bugfix] fix prefill.cuh un-used vars by @DefTruth in #20
- [FFPA] fix some macro typos by @DefTruth in #21
- [misc] fix setup.py by @DefTruth in #22
- [misc] update L20, 4090, A30, 3080 bench by @DefTruth in #23
- [FFPA] support L1 multi-stages 3/4 by @DefTruth in #24
- [Misc] find best tflops across multi-stages by @DefTruth in #25
- [FFPA] rename pyffpa -> ffpa_attn by @DefTruth in #26
- [README] Update README.md by @DefTruth in #27
- [FFPA] L1 support prefetch QKV g2s by @DefTruth in #28
- [Bugfix] fix d < 256 accuracy errors by @DefTruth in #29
- [Feature] support L1 QKV smem separation by @DefTruth in #30
- [Feature] Add ENABLE_FFPA_SMEM_SWIZZLE_V flag by @DefTruth in #31
- [bench] Add RTX 3080 Laptop perf plots by @DefTruth in #32
- [bench] Add more bench perf plots by @DefTruth in #34
- [misc] fix bench link typos by @DefTruth in #35
- [Docs] Add Docker image -> Installation⚙️ by @DefTruth in #36
- [Feature] Add mma mode & fully QKV swizzle by @DefTruth in #37
- [bench] update perf plots for qkv swizzle by @DefTruth in #38
- [bench] update perf plots for qkv swizzle by @DefTruth in #39
- [bench] update perf plots for qkv swizzle by @DefTruth in #40
- [README] Update python test cases by @DefTruth in #41
- [README] Update README.md by @DefTruth in #42
- [feat] add force ffpa Q*K^T mma acc f16 flag by @DefTruth in #43
- [Bugfix] fix ENABLE_FFPA_FORCE_QK_F16 typo by @DefTruth in #44
- [Docs] Add FFPA L1 kernel template signature by @DefTruth in #45
- [feat] support ffpa-l1 persist Q s2r by @DefTruth in #46
- [README] Update ffpa-attn logo by @DefTruth in #47
- [Misc] update ffpa-attn title & logo by @DefTruth in #48
- [Misc] update ffpa-attn title & logo by @DefTruth in #49
- [feat] support ffpa-l1 persist Q g2r by @DefTruth in #50
- [README] Update README.md by @DefTruth in #51
- [feat] Add ffpa-l1 launch_templates by @DefTruth in #52
- [feat] ffpa-l1 persist-qkv g2s for d<=256 by @DefTruth in #53
- [feat] tune block size for L1 persist kv g2s by @DefTruth in #54
- [feat] ffpa-l1 persist-kv-s2r for small d by @DefTruth in #55
- [feat] update ffpa-l1 small d kernel launch configs by @DefTruth in #56
- [README] Update README.md by @DefTruth in #57
- [Bugfix] fix compile error w/o V s2r by @DefTruth in #58
- [feat] refactor launch templates configs by @DefTruth in #59
- [Release] Bump up to v0.0.2 by @DefTruth in #60
- [Release] Bump up to v0.0.2 by @DefTruth in #61
Full Changelog: v0.0.1...v0.0.2