-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chacha20: Improve SSE2 Performance by up to 90% more throughput #379
Comments
Note the SSE2 backend is only relevant to platforms that don't have AVX2, so the relevant benchmarks would ideally happen on pre-AVX2 microarchitectures |
So... I would need to bench it on something that doesn't have AVX2? I tried disabling AVX and AVX2 with Edit: Just found my family's old iMac with an Intel Core 2 Duo. Will hope that it has enough space for Rust |
Sure, just any one that's a reasonable demonstration of where that backend would actually be selected at runtime in practice. (Note: this is why supporting backends for legacy CPUs is hard) |
The results are in. |
Okay, so closer to a 30% improvement. Sounds good. |
Should 1.6 or 4.3 cpb be in |
Unfortunately, this does not disable runtime detection of AVX2 since Rust currently does not have proper "negative" target features. You need to pass the
Personally, I think this information should not be part of README. |
No worries. I ran the benches with that
Now that I've seen how wide the performance ranges based on the hardware, I'm inclined to agree with you about its presence in the README. However, I'm sure there are a few fellow nerds out there who are curious how this compares to other implementations (at least on AVX2 and NEON), but they can always bench it themselves. The AVX2 bench is a little off though when benching with |
The problem is that it's quite hard to do such benchmarks properly. Even on fixed hardware CPB measurements are usually not reliable without using the AMD-specific RDPRU feature. And it's before touching the differences between cache and memory speed and latency which can considerably skew results depending on hardware and buffer sizes. The benchmarks that we have are fine to guide optimizations during development, but I am hesitant to show their results to users. |
SSE2 optimizations seem like a niche feature. I don't think people running pre-AVX2 CPUs care considerably about performance. |
It is still a 30% improvement. Plus, it would make the |
I wondered for a while why the NEON backend is able to process 4 blocks at a time while the SSE2 backend could only process one block at a time. Chat GPT simply told me that NEON is better at parallelization than SSE2, but I wasn't buying it. I decided to increase the output buffer of the SSE2 backend from 1 block to 4 blocks, and I saw a massive performance boost at about
1.6 cpb
.I went ahead and cleaned up my old code and converted it to the state that it's in now, keeping most of the original
chacha20
code intact.The text was updated successfully, but these errors were encountered: