Releases: DefTruth/cuffpa-py
Releases · DefTruth/cuffpa-py
🎉 cuffpa-py 0.0.1 beta L1 Release
📖 FFPA L1 (Level 1): Benchmark 🎉🎉
L1: level 1, O(Brx16)~O(1) SRAM complexity, O(d/4) register complexity, the same GPU HBM memory complexity as FlashAttention. B=1, H=48, N=8192, D=320-1024(FA2 not supported 👀). (Notes, *
=MMA Acc F32, ^
=MMA Acc F16, Softmax Acc dtype is always be F32, T=TFLOPS, 👇Benchmark)
- 📚 NVIDIA RTX 3080 Laptop (
*
=MMA Acc F32, ^
=MMA Acc F16, T
=TFLOPS)
Algorithm |
320 |
384 |
448 |
512 |
576 |
640 |
704 |
768 |
832 |
896 |
960 |
1024 |
SDPA EA |
13T |
16T |
12T |
16T |
15T |
15T |
15T |
15T |
15T |
15T |
15T |
15T |
FFPA L1* |
32T |
30T |
30T |
28T |
28T |
27T |
26T |
25T |
25T |
25T |
25T |
24T |
Speedup |
2.48x |
1.88x |
2.55x |
1.75x |
1.90x |
1.77x |
1.73x |
1.67x |
1.66x |
1.66x |
1.66x |
1.54x |
FFPA L1^ |
40T |
38T |
39T |
36T |
35T |
34T |
33T |
32T |
31T |
31T |
28T |
27T |
Speedup |
3.07x |
2.42x |
3.33x |
2.24x |
2.35x |
2.19x |
2.19x |
2.13x |
2.03x |
2.03x |
1.90x |
1.74x |
- 📚 NVIDIA RTX 4090 (
*
=MMA Acc F32, ^
=MMA Acc F16, T
=TFLOPS)
Algorithm |
320 |
384 |
448 |
512 |
576 |
640 |
704 |
768 |
832 |
896 |
960 |
1024 |
SDPA EA |
82T |
92T |
83T |
84T |
78T |
80T |
78T |
80T |
78T |
80T |
78T |
79T |
FFPA L1* |
136T |
135T |
135T |
132T |
133T |
133T |
132T |
131T |
130T |
125T |
123T |
93T |
Speedup |
1.64x |
1.45x |
1.61x |
1.57x |
1.71x |
1.65x |
1.68x |
1.62x |
1.65x |
1.56x |
1.55x |
1.17x |
FFPA L1^ |
154T |
161T |
160T |
157T |
156T |
155T |
157T |
154T |
149T |
150T |
145T |
100T |
Speedup |
1.85x |
1.73x |
1.92x |
1.87x |
1.99x |
1.93x |
1.99x |
1.90x |
1.90x |
1.88x |
1.84x |
1.25x |
- 📚 NVIDIA L20 (
*
=MMA Acc F32, ^
=MMA Acc F16, T
=TFLOPS)
Algorithm |
320 |
384 |
448 |
512 |
576 |
640 |
704 |
768 |
832 |
896 |
960 |
1024 |
SDPA EA |
56T |
63T |
57T |
58T |
55T |
56T |
54T |
55T |
54T |
55T |
54T |
56T |
FFPA L1* |
99T |
95T |
95T |
93T |
94T |
92T |
92T |
90T |
89T |
90T |
90T |
89T |
Speedup |
1.77x |
1.49x |
1.64x |
1.58x |
1.72x |
1.65x |
1.68x |
1.63x |
1.64x |
1.63x |
1.67x |
1.58x |
FFPA L1^ |
96T |
99T |
100T |
92T |
93T |
92T |
93T |
91T |
90T |
90T |
88T |
91T |
Speedup |
1.71x |
1.55x |
1.73x |
1.56x |
1.69x |
1.65x |
1.71x |
1.64x |
1.65x |
1.63x |
1.62x |
1.62x |
- 📚 NVIDIA A30 (
*
=MMA Acc F32, ^
=MMA Acc F16, T
=TFLOPS)
Algorithm |
320 |
384 |
448 |
512 |
576 |
640 |
704 |
768 |
832 |
896 |
960 |
1024 |
SDPA EA |
25T |
25T |
24T |
23T |
24T |
24T |
23T |
22T |
22T |
21T |
21T |
18T |
FFPA L1* |
33T |
33T |
32T |
31T |
32T |
32T |
30T |
28T |
25T |
24T |
24T |
24T |
Speedup |
1.33x |
1.33x |
1.30x |
1.31x |
1.33x |
1.33x |
1.32x |
1.23x |
1.15x |
1.11x |
1.11x |
1.27x |
FFPA L1^ |
33T |
33T |
33T |
30T |
31T |
32T |
31T |
30T |
30T |
27T |
24T |
23T |
Speedup |
1.33x |
1.33x |
1.36x |
1.30x |
1.31x |
1.33x |
1.37x |
1.35x |
1.35x |
1.25x |
1.11x |
1.25x |