-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zeroize: use asm!
to improve performance
#841
base: master
Are you sure you want to change the base?
Conversation
The purpose of the change is to make calls to `x.as_mut_slice().zeroize()` considerably faster, particularly for types like `[u8; n]`. The reason it becomes faster is that the call to `volatile_set` before this change appears not to be easily optimizable, and (for example) leads to setting bytes one at a time, instead of the compiler consolidating them into SIMD instructions. In the modified code, we don't use `volatile_set`, we instead loop over the slice setting the elements to `Default::default()`, and to ensure that the writes are not optimized out, we use an empty asm block. There is discussion of the correct asm options to use here in the issue. Because the asm block potentially reads from the pointer and makes a syscall of some kind, the compiler cannot optimize out the zeroizing, or it could cause observable side-effects. In the improved code, we only create such an optimization barrier once, rather than after each byte that it is written. The call to `atomic_fence()` is not changed. --- This change may help give users a way to improve performance, if they have to zeroize very large objects, or, frequently have to zeroize many small objects. We tested code-gen here in godbolt (in addition to the tests posted in the github issue) and found that this change is typically enough for llvm to start adding in SIMD optimizations that zero many bytes at once.
This should probably be feature gated to avoid a massive MSRV bump and disrupting existing users. That was very problematic the last time we bumped MSRV to add const generic support. And if all the inline ASM is doing is providing an optimization barrier, it seems like |
I'm a little worried about this part of the documentation for Maybe we should just use the asm and feature gate it? |
Oh wow, definitely an important detail about An ASM optimization barrier seems good then, although let me run this implementation by a few people. It would definitely still be good to feature gate it in order to preserve MSRV. |
Hmm, when did that warning get added? It doesn't appear on the current stable docs. Is it new? |
Maybe I looked up the wrong docs, my url says "beta". Maybe they removed that later. |
Maybe we can look up the implementation, if it's very similar to Sopium's suggested barrier then maybe it's fine |
This is the optimization barrier @chandlerc recommended (C++ version, similar idea): https://compiler-explorer.com/z/bh9WzvTPq |
To me that implies those docs were recently added. I guess we'll see what happens in the next release. |
core::arch::asm!( | ||
"/* {ptr} */", | ||
ptr = in(reg) self.as_mut_ptr(), | ||
options(nostack, readonly, preserves_flags), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asm!
is only stable for x86/x86-64, ARM/AArch64, and RISC-V, so its usage needs to be gated for those platforms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 32 in f8f6f6e
- No FFI or inline assembly! **WASM friendly** (and tested)! |
This guarantee needs to be updated if this change is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere we feature-gate asm
, so we could perhaps maintain that guarantee so long as the asm
feature is off
asm!
to improve performance
Worth following some discussion here: https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/black_box.20and.20crypto |
The purpose of the change is to make calls to
x.as_mut_slice().zeroize()
considerably faster, particularly for types like[u8; n]
. We take @sopium's proposed code from #743 without significant changes.The reason it becomes faster is that the call to
volatile_set
before this change appears not to be easily optimizable, and (for example) leads to setting bytes one at a time, instead of the compiler consolidating them into SIMD instructions.In the modified code, we don't use
volatile_set
, we instead loop over the slice setting the elements toDefault::default()
, and to ensure that the writes are not optimized out, we use an empty asm block. (There is discussion of the correct asm options to use here in the issue.)Because the asm block potentially reads from the pointer and makes a syscall of some kind, the compiler cannot optimize out the zeroizing, or it could cause observable side-effects. In the improved code, we only create such an optimization barrier once, rather than after each byte that it is written.
The call to
atomic_fence()
is not changed.This change may help give users a way to improve performance, if they have to zeroize very large arrays, or, frequently have to zeroize many small objects. We tested code-gen here in godbolt (in addition to the tests posted in the github issue) and found that this change is typically enough for llvm to start adding in SIMD optimizations that zero many bytes at once.