zeroize: use `asm!` to improve performance #841

cbeck88 · 2023-02-28T07:09:59Z

The purpose of the change is to make calls to x.as_mut_slice().zeroize() considerably faster, particularly for types like [u8; n]. We take @sopium's proposed code from #743 without significant changes.

The reason it becomes faster is that the call to volatile_set before this change appears not to be easily optimizable, and (for example) leads to setting bytes one at a time, instead of the compiler consolidating them into SIMD instructions.

In the modified code, we don't use volatile_set, we instead loop over the slice setting the elements to Default::default(), and to ensure that the writes are not optimized out, we use an empty asm block. (There is discussion of the correct asm options to use here in the issue.)

Because the asm block potentially reads from the pointer and makes a syscall of some kind, the compiler cannot optimize out the zeroizing, or it could cause observable side-effects. In the improved code, we only create such an optimization barrier once, rather than after each byte that it is written.

The call to atomic_fence() is not changed.

This change may help give users a way to improve performance, if they have to zeroize very large arrays, or, frequently have to zeroize many small objects. We tested code-gen here in godbolt (in addition to the tests posted in the github issue) and found that this change is typically enough for llvm to start adding in SIMD optimizations that zero many bytes at once.

The purpose of the change is to make calls to `x.as_mut_slice().zeroize()` considerably faster, particularly for types like `[u8; n]`. The reason it becomes faster is that the call to `volatile_set` before this change appears not to be easily optimizable, and (for example) leads to setting bytes one at a time, instead of the compiler consolidating them into SIMD instructions. In the modified code, we don't use `volatile_set`, we instead loop over the slice setting the elements to `Default::default()`, and to ensure that the writes are not optimized out, we use an empty asm block. There is discussion of the correct asm options to use here in the issue. Because the asm block potentially reads from the pointer and makes a syscall of some kind, the compiler cannot optimize out the zeroizing, or it could cause observable side-effects. In the improved code, we only create such an optimization barrier once, rather than after each byte that it is written. The call to `atomic_fence()` is not changed. --- This change may help give users a way to improve performance, if they have to zeroize very large objects, or, frequently have to zeroize many small objects. We tested code-gen here in godbolt (in addition to the tests posted in the github issue) and found that this change is typically enough for llvm to start adding in SIMD optimizations that zero many bytes at once.

tarcieri · 2023-02-28T15:36:06Z

This should probably be feature gated to avoid a massive MSRV bump and disrupting existing users. That was very problematic the last time we bumped MSRV to add const generic support.

And if all the inline ASM is doing is providing an optimization barrier, it seems like core::hint::black_box could be used instead.

cbeck88 · 2023-03-01T17:55:28Z

I'm a little worried about this part of the documentation for core::hint::black_box: https://doc.rust-lang.org/beta/core/hint/fn.black_box.html#when-is-this-useful

Maybe we should just use the asm and feature gate it?

tarcieri · 2023-03-01T19:19:28Z

Oh wow, definitely an important detail about core::hint::black_box!

An ASM optimization barrier seems good then, although let me run this implementation by a few people.

It would definitely still be good to feature gate it in order to preserve MSRV.

tarcieri · 2023-03-01T19:20:44Z

Hmm, when did that warning get added?

It doesn't appear on the current stable docs. Is it new?

https://doc.rust-lang.org/stable/std/hint/fn.black_box.html

cbeck88 · 2023-03-01T20:03:46Z

Maybe I looked up the wrong docs, my url says "beta". Maybe they removed that later.

cbeck88 · 2023-03-01T20:04:24Z

Maybe we can look up the implementation, if it's very similar to Sopium's suggested barrier then maybe it's fine

tarcieri · 2023-03-01T20:09:33Z

This is the optimization barrier @chandlerc recommended (C++ version, similar idea): https://compiler-explorer.com/z/bh9WzvTPq

tarcieri · 2023-03-01T20:10:16Z

Maybe I looked up the wrong docs, my url says "beta". Maybe they removed that later.

To me that implies those docs were recently added. I guess we'll see what happens in the next release.

tarcieri · 2023-03-04T23:18:30Z

zeroize/src/lib.rs

+            core::arch::asm!(
+                "/* {ptr} */",
+                ptr = in(reg) self.as_mut_ptr(),
+                options(nostack, readonly, preserves_flags),
+            );


asm! is only stable for x86/x86-64, ARM/AArch64, and RISC-V, so its usage needs to be gated for those platforms

utils/zeroize/README.md

Line 32 in f8f6f6e

- No FFI or inline assembly! **WASM friendly** (and tested)!

This guarantee needs to be updated if this change is merged.

Elsewhere we feature-gate asm, so we could perhaps maintain that guarantee so long as the asm feature is off

elichai · 2023-05-18T14:08:21Z

Worth following some discussion here: https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/black_box.20and.20crypto

cbeck88 mentioned this pull request Feb 28, 2023

Make the Tx object zeroize itself on drop mobilecoinfoundation/mobilecoin#2925

Open

tarcieri reviewed Mar 4, 2023

View reviewed changes

tarcieri changed the title ~~improve performance for slice zeroization (issue #743)~~ zeroize: use asm! to improve performance Mar 4, 2023

tarcieri mentioned this pull request Mar 13, 2023

Revert new black_box optimization barrier dalek-cryptography/subtle#107

Merged

tarcieri mentioned this pull request May 18, 2023

Zeroize: Discrepency between [Z; N], `[Z], Vec<Z>, Box<[Z]> #899

Closed

brxken128 mentioned this pull request Oct 22, 2023

zeroize: wasm v128, more aarch64 registers, const Zeroizing::new() #964

Closed

tarcieri mentioned this pull request Jan 22, 2024

Why does Zeroize use a fence? #988

Open

tarcieri mentioned this pull request Apr 15, 2024

Why DefaultIsZeroes requires Copy? #1062

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zeroize: use `asm!` to improve performance #841

zeroize: use `asm!` to improve performance #841

cbeck88 commented Feb 28, 2023 •

edited

Loading

tarcieri commented Feb 28, 2023

cbeck88 commented Mar 1, 2023 •

edited

Loading

tarcieri commented Mar 1, 2023

tarcieri commented Mar 1, 2023

cbeck88 commented Mar 1, 2023

cbeck88 commented Mar 1, 2023

tarcieri commented Mar 1, 2023

tarcieri commented Mar 1, 2023

tarcieri Mar 4, 2023

MasterAwesome Dec 17, 2023

tarcieri Dec 17, 2023

elichai commented May 18, 2023

zeroize: use asm! to improve performance #841

Are you sure you want to change the base?

zeroize: use asm! to improve performance #841

Conversation

cbeck88 commented Feb 28, 2023 • edited Loading

tarcieri commented Feb 28, 2023

cbeck88 commented Mar 1, 2023 • edited Loading

tarcieri commented Mar 1, 2023

tarcieri commented Mar 1, 2023

cbeck88 commented Mar 1, 2023

cbeck88 commented Mar 1, 2023

tarcieri commented Mar 1, 2023

tarcieri commented Mar 1, 2023

tarcieri Mar 4, 2023

Choose a reason for hiding this comment

MasterAwesome Dec 17, 2023

Choose a reason for hiding this comment

tarcieri Dec 17, 2023

Choose a reason for hiding this comment

elichai commented May 18, 2023

zeroize: use `asm!` to improve performance #841

zeroize: use `asm!` to improve performance #841

cbeck88 commented Feb 28, 2023 •

edited

Loading

cbeck88 commented Mar 1, 2023 •

edited

Loading