chacha20: Improve SSE2 Performance by up to 90% more throughput #379

nstilt1 · 2025-01-01T19:06:53Z

I wondered for a while why the NEON backend is able to process 4 blocks at a time while the SSE2 backend could only process one block at a time. Chat GPT simply told me that NEON is better at parallelization than SSE2, but I wasn't buying it. I decided to increase the output buffer of the SSE2 backend from 1 block to 4 blocks, and I saw a massive performance boost at about 1.6 cpb.

I went ahead and cleaned up my old code and converted it to the state that it's in now, keeping most of the original chacha20 code intact.

The text was updated successfully, but these errors were encountered:

tarcieri · 2025-01-02T00:53:33Z

Note the SSE2 backend is only relevant to platforms that don't have AVX2, so the relevant benchmarks would ideally happen on pre-AVX2 microarchitectures

nstilt1 · 2025-01-02T01:53:41Z

So... I would need to bench it on something that doesn't have AVX2? I tried disabling AVX and AVX2 with -C target-feature=-avx,-avx2 when benching, but I might be able to find a pre-AVX2 machine. Is it okay if I find one that has AVX, but not AVX2?

Edit: Just found my family's old iMac with an Intel Core 2 Duo. Will hope that it has enough space for Rust

tarcieri · 2025-01-02T02:23:45Z

Sure, just any one that's a reasonable demonstration of where that backend would actually be selected at runtime in practice.

(Note: this is why supporting backends for legacy CPUs is hard)

nstilt1 · 2025-01-02T03:14:34Z

The results are in. 6.2-6.4 cpb on the Master branch (had to make a few adjustments to make it compile). 4.2-4.4 cpb on my branch. I'll go ahead and copy the output to a USB if you want to see it

tarcieri · 2025-01-02T15:59:19Z

Okay, so closer to a 30% improvement. Sounds good.

nstilt1 · 2025-01-03T00:21:42Z

Should 1.6 or 4.3 cpb be in README.md? Or should I just leave it unchanged since the performance is hardware dependent?

newpavlov · 2025-01-03T00:31:51Z

I tried disabling AVX and AVX2 with -C target-feature=-avx,-avx2 when benching, but I might be able to find a pre-AVX2 machine.

Unfortunately, this does not disable runtime detection of AVX2 since Rust currently does not have proper "negative" target features. You need to pass the chacha20_force_sse2 configuration flag for benchmarking the SSE2 backend.

Should 1.6 or 4.3 cpb be in README.md?

Personally, I think this information should not be part of README.

nstilt1 · 2025-01-03T01:50:46Z

I tried disabling AVX and AVX2 with -C target-feature=-avx,-avx2 when benching, but I might be able to find a pre-AVX2 machine.

Unfortunately, this does not disable runtime detection of AVX2 since Rust currently does not have proper "negative" target features. You need to pass the chacha20_force_sse2 configuration flag for benchmarking the SSE2 backend.

No worries. I ran the benches with that cfg flag as well.

Should 1.6 or 4.3 cpb be in README.md?

Personally, I think this information should not be part of README.

Now that I've seen how wide the performance ranges based on the hardware, I'm inclined to agree with you about its presence in the README. However, I'm sure there are a few fellow nerds out there who are curious how this compares to other implementations (at least on AVX2 and NEON), but they can always bench it themselves. The AVX2 bench is a little off though when benching with -C target-cpu=native on appropriate hardware, but if we were to keep it, it might be good to be slightly more detailed about benching with that flag versus a more generic benchmark (you know, for distributed software where there's no target cpu). I don't know. Just food for thought.

newpavlov · 2025-01-03T02:18:55Z

The problem is that it's quite hard to do such benchmarks properly. Even on fixed hardware CPB measurements are usually not reliable without using the AMD-specific RDPRU feature. And it's before touching the differences between cache and memory speed and latency which can considerably skew results depending on hardware and buffer sizes.

The benchmarks that we have are fine to guide optimizations during development, but I am hesitant to show their results to users.

tarcieri · 2025-01-03T03:02:27Z

SSE2 optimizations seem like a niche feature. I don't think people running pre-AVX2 CPUs care considerably about performance.

nstilt1 · 2025-01-03T22:51:12Z

It is still a 30% improvement. Plus, it would make the ParBlocksSize = U4 the same for all SIMD backends, which could make decisions about buffer sizes a little bit easier. And it will fill the RNG's whole buffer.

nstilt1 linked a pull request Jan 1, 2025 that will close this issue

Improved SSE2 throughput by up to +90% #380

Open

nstilt1 changed the title ~~Improve ChaCha20 SSE2 Performance by up to 90% more throughput~~ chacha20: Improve SSE2 Performance by up to 90% more throughput Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chacha20: Improve SSE2 Performance by up to 90% more throughput #379

chacha20: Improve SSE2 Performance by up to 90% more throughput #379

nstilt1 commented Jan 1, 2025

tarcieri commented Jan 2, 2025

nstilt1 commented Jan 2, 2025 •

edited

Loading

tarcieri commented Jan 2, 2025 •

edited

Loading

nstilt1 commented Jan 2, 2025

tarcieri commented Jan 2, 2025

nstilt1 commented Jan 3, 2025

newpavlov commented Jan 3, 2025

nstilt1 commented Jan 3, 2025

newpavlov commented Jan 3, 2025

tarcieri commented Jan 3, 2025

nstilt1 commented Jan 3, 2025

chacha20: Improve SSE2 Performance by up to 90% more throughput #379

chacha20: Improve SSE2 Performance by up to 90% more throughput #379

Comments

nstilt1 commented Jan 1, 2025

tarcieri commented Jan 2, 2025

nstilt1 commented Jan 2, 2025 • edited Loading

tarcieri commented Jan 2, 2025 • edited Loading

nstilt1 commented Jan 2, 2025

tarcieri commented Jan 2, 2025

nstilt1 commented Jan 3, 2025

newpavlov commented Jan 3, 2025

nstilt1 commented Jan 3, 2025

newpavlov commented Jan 3, 2025

tarcieri commented Jan 3, 2025

nstilt1 commented Jan 3, 2025

nstilt1 commented Jan 2, 2025 •

edited

Loading

tarcieri commented Jan 2, 2025 •

edited

Loading