-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zeroize performance on u8 arrays #743
Comments
@sopium we've done practically no optimization work, so I'm glad you're thinking about it. Inline assembly is fine, but would need to be feature gated due to the higher MSRV (we can eventually consider bumping it, but our last bump was already painful) For something like |
It boils down to LLVM not able to optimize I think monomorphize for 16/32/64-byte arrays should be fine. These would also be commonly used key sizes in cryptography so the optimization can be justified. For a u8 slice with unknown but largish size you can also use pub fn zeroize_slice(x: &mut [u8]) {
unsafe {
// Use __m128 on x86/x86_64.
let (p, m, s) = x.align_to_mut::<u64>();
for b in p {
core::ptr::write_volatile(b, 0);
}
for m in m {
core::ptr::write_volatile(m, 0);
}
for b in s {
core::ptr::write_volatile(b, 0);
}
}
} |
The use of But inline assembly should also ensure that as well, and permit a highly performant implementation. |
Yes, I understand this. If it's “normal” write in a loop LLVM would be able recognize it and generate efficient instructions or call |
Yeah the typical way to implement a zeroization primitive efficiently is some sort of volatile While Rust has bindings to LLVM's volatile memset intrinsics, they are perma-unstable and no work has been done on a stable API AFAIK: https://doc.rust-lang.org/std/intrinsics/fn.volatile_set_memory.html But perhaps inline assembly could provide the next best thing, albeit on a target architecture-by-architecture basis. |
We can not use |
I was suggesting using As for splitting to aligned and unaligned parts, that's exactly what |
Oh this reminds me, now that we have stable inline assembly, this can be done too, and it's not architecture dependent: pub fn zeroize(x: &mut [u8]) {
for b in x.iter_mut() {
*b = 0;
}
unsafe {
core::arch::asm!(
"/* {ptr} */",
ptr = in(reg) x.as_mut_ptr(),
options(nostack, readonly, preserves_flags),
);
}
}
pub fn f() {
let mut x = [0u8; 32];
zeroize(&mut x);
} Playground: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=32fc07b8173bd2ce233eaa9bf0d8d81a
|
@sopium I have a question about the use of I watched a talk by Chandler Carruth (major contributor to LLVM) a long time ago, where he described creating similar functions in order to defeat the optimizer for benchmarking purposes: https://www.youtube.com/watch?v=nXaxk27zwlk&t=2476s
I realize that this is a C++ talk and not rust, but they are both going through LLVM in the end so there's a lot of overlap. In his talk he seems to say that it's important for the optimizer to believe that In your block you use the
So, the optimizer has to believe that
So 3 is good because it will help the efficiency of code using this block. But it's hard for me to understand, if we're promising the optimizer that flags are not changed, registers are not changed (only the in(reg) for |
@cbeck88 |
I see, thank you |
It looks to me that one barrier to actually creating a patch that does this is the need for some form of the specialization language feature. We currently have these impl's:
Line 361 in 91c2c21
Line 461 in 91c2c21
If we wanted to also add impls for the special case that I had this same problem in a hashing library: https://github.com/mobilecoinfoundation/mobilecoin/blob/b07c17fd2e1946c15b3a38c30b0fb1dc4ab516e3/crypto/digestible/src/lib.rs#L218 What I decided to do there was, simply not support my trait on the primitive type In your case you already shipped a stable API which supports |
|
Good point So do you see a path forward for this? Or just wait for specialization? It seems to me we could make an optimized "zeroize_bytes" function based on the approach here, and people could either implement |
A free function works, although is somewhat inelegant. It might be possible to optimize the impl on Something like this: const QUADWORD_SIZE: usize = 8;
const SIZE_OF_T: usize = mem::size_of::<T>();
if (SIZE_OF_T < QUADWORD_SIZE) && (QUADWORD_SIZE % SIZE_OF_T == 0) {
const COPIES_IN_QUADWORD: usize = QUADWORD_SIZE / SIZE_OF_T;
let quadword = [T::default(); COPIES_IN_QUADWORD];
let mut iter = slice.chunks_exact_mut(COPIES_IN_QUADWORD);
for chunk in iter {
movups(chunk, &quadword as *const T as *const u64);
}
slow_path(iter.remainder());
} else {
slow_path(slice);
} |
Interesting idea, maybe this is a good use-case for https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut Looking more at the code, it looks to me that we already have this impl -- maybe not directly relevant to OP: Line 474 in 91c2c21
/// Impl [`Zeroize`] on slices of types that can be zeroized with [`Default`].
///
/// This impl can eventually be optimized using an memset intrinsic,
/// such as [`core::intrinsics::volatile_set_memory`]. For that reason the
/// blanket impl on slices is bounded by [`DefaultIsZeroes`].
///
/// To zeroize a mut slice of `Z: Zeroize` which does not impl
/// [`DefaultIsZeroes`], call `iter_mut().zeroize()`.
impl<Z> Zeroize for [Z]
where
Z: DefaultIsZeroes,
{
fn zeroize(&mut self) {
assert!(self.len() <= isize::MAX as usize);
// Safety:
//
// This is safe, because the slice is well aligned and is backed by a single allocated
// object for at least `self.len()` elements of type `Z`.
// `self.len()` is also not larger than an `isize`, because of the assertion above.
// The memory of the slice should not wrap around the address space.
unsafe { volatile_set(self.as_mut_ptr(), Z::default(), self.len()) };
atomic_fence();
}
} It's too bad that But maybe this is the place to do this kind of optimization work, and people can call |
The impls on slice types call |
Here's the thing -- I bet that Here's an idea: Maybe we can refactor things so that we don't have special-casing based on What if we made pub trait Zeroize {
/// The bit pattern that we will set objects of this type to when zeroizing them
fn zero_pattern() -> Self;
/// Write the zero pattern over an object, can't be optimized away
fn zeroize(&mut self) {
volatile_write(self, Self::zero_pattern());
atomic_fence();
}
} Then the idea is we could have I think we might not need to wait for specialization if we did that, or something along these lines. |
First, you're proposing breaking changes to a post-1.0 crate which exposes a trait-oriented API. Those would only be accepted as a last resort. But also,
|
Yeah -- I see that, it's going to get more complicated when you want the implementation for Intuitively, what I'd like to say is, what if we try to implement the general case where, I understand it's a breaking change, just trying to think about directions that might make things simpler or better. |
The big problem with any breaking change is like It has ~500 downstream dependencies, and any kind of upgrade needs to be coordinated across all of them. So really breaking changes need to be reserved for things that absolutely must make them. And I don't see any reason that mandates such a drastic change.
That's not an issue. The impl of The impl on |
I'm just trying to understand, even if we did optimize the Because if I have code like OP:
it's not going to call the zeroize function for slices, it's going to call the Suppose I have code like
This is not going to call the In the meantime, from experiments it looks that
will convince the compiler to pick the faster implementation, so maybe we could do the optimization we're talking about and document that that's what you have to do as a user if zeroizing one byte at a time isn't good enough? And for a struct like It would be nice if we could make this easier for the user of the crate somehow, but I guess I don't see a way if we're not interested in discussing zeroize 2.0 here. Thanks. |
Borrow the value as a mutable slice, then call It's a bit of a pain, but the best we can do without specialization.
I still don't see anything worth making breaking changes over until specialization lands, and even then it would probably still make sense to have a |
The purpose of the change is to make calls to `x.as_mut_slice().zeroize()` considerably faster, particularly for types like `[u8; n]`. The reason it becomes faster is that the call to `volatile_set` before this change appears not to be easily optimizable, and (for example) leads to setting bytes one at a time, instead of the compiler consolidating them into SIMD instructions. In the modified code, we don't use `volatile_set`, we instead loop over the slice setting the elements to `Default::default()`, and to ensure that the writes are not optimized out, we use an empty asm block. There is discussion of the correct asm options to use here in the issue. Because the asm block potentially reads from the pointer and makes a syscall of some kind, the compiler cannot optimize out the zeroizing, or it could cause observable side-effects. In the improved code, we only create such an optimization barrier once, rather than after each byte that it is written. The call to `atomic_fence()` is not changed. --- This change may help give users a way to improve performance, if they have to zeroize very large objects, or, frequently have to zeroize many small objects. We tested code-gen here in godbolt (in addition to the tests posted in the github issue) and found that this change is typically enough for llvm to start adding in SIMD optimizations that zero many bytes at once.
I inspected the generated assembly code and benchmarked
zeroize
for[u8; 32]
on x86_64 and found it quite inefficient, storing one byte at a time:https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=3f44f4b90e6af0eac0dcb1f649390329
On my Ryzen CPU, it takes ~7.8324 ns, or ~1cpb. Binary code size is also quite large.
Using inline assembly (just stabilized in 1.59) and SSE2, zeroing a
[u8; 32]
takes just 3 instructions and ~492.87 ps (~16 bytes per cycle):So it might be something worth optimizing/documenting.
If you do not want to use inline assembly, maybe you should encourage using larger types or SIMD types, e.g.,
[u64; 4]
or[__m128; 2]
instead of[u8; 32]
. Usingwrite_volatile
on*mut __m128
generates equally compact and efficient code as the assembly code above.The text was updated successfully, but these errors were encountered: