Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chacha20: Improve SSE2 Performance by up to 90% more throughput #379

Open
nstilt1 opened this issue Jan 1, 2025 · 11 comments · May be fixed by #380
Open

chacha20: Improve SSE2 Performance by up to 90% more throughput #379

nstilt1 opened this issue Jan 1, 2025 · 11 comments · May be fixed by #380

Comments

@nstilt1
Copy link
Contributor

nstilt1 commented Jan 1, 2025

I wondered for a while why the NEON backend is able to process 4 blocks at a time while the SSE2 backend could only process one block at a time. Chat GPT simply told me that NEON is better at parallelization than SSE2, but I wasn't buying it. I decided to increase the output buffer of the SSE2 backend from 1 block to 4 blocks, and I saw a massive performance boost at about 1.6 cpb.

I went ahead and cleaned up my old code and converted it to the state that it's in now, keeping most of the original chacha20 code intact.

@nstilt1 nstilt1 linked a pull request Jan 1, 2025 that will close this issue
@nstilt1 nstilt1 changed the title Improve ChaCha20 SSE2 Performance by up to 90% more throughput chacha20: Improve SSE2 Performance by up to 90% more throughput Jan 1, 2025
@tarcieri
Copy link
Member

tarcieri commented Jan 2, 2025

Note the SSE2 backend is only relevant to platforms that don't have AVX2, so the relevant benchmarks would ideally happen on pre-AVX2 microarchitectures

@nstilt1
Copy link
Contributor Author

nstilt1 commented Jan 2, 2025

So... I would need to bench it on something that doesn't have AVX2? I tried disabling AVX and AVX2 with -C target-feature=-avx,-avx2 when benching, but I might be able to find a pre-AVX2 machine. Is it okay if I find one that has AVX, but not AVX2?

Edit: Just found my family's old iMac with an Intel Core 2 Duo. Will hope that it has enough space for Rust

@tarcieri
Copy link
Member

tarcieri commented Jan 2, 2025

Sure, just any one that's a reasonable demonstration of where that backend would actually be selected at runtime in practice.

(Note: this is why supporting backends for legacy CPUs is hard)

@nstilt1
Copy link
Contributor Author

nstilt1 commented Jan 2, 2025

The results are in. 6.2-6.4 cpb on the Master branch (had to make a few adjustments to make it compile). 4.2-4.4 cpb on my branch. I'll go ahead and copy the output to a USB if you want to see it

@tarcieri
Copy link
Member

tarcieri commented Jan 2, 2025

Okay, so closer to a 30% improvement. Sounds good.

@nstilt1
Copy link
Contributor Author

nstilt1 commented Jan 3, 2025

Should 1.6 or 4.3 cpb be in README.md? Or should I just leave it unchanged since the performance is hardware dependent?

@newpavlov
Copy link
Member

I tried disabling AVX and AVX2 with -C target-feature=-avx,-avx2 when benching, but I might be able to find a pre-AVX2 machine.

Unfortunately, this does not disable runtime detection of AVX2 since Rust currently does not have proper "negative" target features. You need to pass the chacha20_force_sse2 configuration flag for benchmarking the SSE2 backend.

Should 1.6 or 4.3 cpb be in README.md?

Personally, I think this information should not be part of README.

@nstilt1
Copy link
Contributor Author

nstilt1 commented Jan 3, 2025

I tried disabling AVX and AVX2 with -C target-feature=-avx,-avx2 when benching, but I might be able to find a pre-AVX2 machine.

Unfortunately, this does not disable runtime detection of AVX2 since Rust currently does not have proper "negative" target features. You need to pass the chacha20_force_sse2 configuration flag for benchmarking the SSE2 backend.

No worries. I ran the benches with that cfg flag as well.

Should 1.6 or 4.3 cpb be in README.md?

Personally, I think this information should not be part of README.

Now that I've seen how wide the performance ranges based on the hardware, I'm inclined to agree with you about its presence in the README. However, I'm sure there are a few fellow nerds out there who are curious how this compares to other implementations (at least on AVX2 and NEON), but they can always bench it themselves. The AVX2 bench is a little off though when benching with -C target-cpu=native on appropriate hardware, but if we were to keep it, it might be good to be slightly more detailed about benching with that flag versus a more generic benchmark (you know, for distributed software where there's no target cpu). I don't know. Just food for thought.

@newpavlov
Copy link
Member

The problem is that it's quite hard to do such benchmarks properly. Even on fixed hardware CPB measurements are usually not reliable without using the AMD-specific RDPRU feature. And it's before touching the differences between cache and memory speed and latency which can considerably skew results depending on hardware and buffer sizes.

The benchmarks that we have are fine to guide optimizations during development, but I am hesitant to show their results to users.

@tarcieri
Copy link
Member

tarcieri commented Jan 3, 2025

SSE2 optimizations seem like a niche feature. I don't think people running pre-AVX2 CPUs care considerably about performance.

@nstilt1
Copy link
Contributor Author

nstilt1 commented Jan 3, 2025

It is still a 30% improvement. Plus, it would make the ParBlocksSize = U4 the same for all SIMD backends, which could make decisions about buffer sizes a little bit easier. And it will fill the RNG's whole buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants