Enum field align cause performance degradation about 10x #119247

PureWhiteWu · 2023-12-23T11:26:52Z

Hi, I'm the author of FastStr crate, and recently I found a wired problem that the clone cost of FastStr is really high. For example, an empty FastStr clone costs about 40ns on amd64 compared to about 4ns of a normal String.

The FastStr itself is a newtype of the inner Repr, which previously has the following layout:

#[derive(Clone)]
pub type FastStr(Repr);

#[cfg(all(test, target_pointer_width = "64"))]
mod size_asserts {
    static_assertions::assert_eq_size!(super::FastStr, [u8; 40]); // 40 bytes
}

const INLINE_CAP: usize = 38;

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

Playground link for old version

After some time of investigation, I found that this is because the Repr::Inline part has really great affect on the performance. And after I added a padding to the Repr::Inline variant(change the type of len from u8 to usize), the performance of clone a Repr::Empty(and other variants all) boosts about 9x from 40ns to 4ns. But the root cause is still not clear:

const INLINE_CAP: usize = 24; // This is becuase I don't want to enlarge the size of FastStr

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: usize, buf: [u8; INLINE_CAP] },
}

Playground link for new version

A simple criterion benchmark code for the old version:

use bytes::Bytes;
use std::sync::Arc;
use criterion::{black_box, criterion_group, criterion_main, Criterion};

const INLINE_CAP: usize = 38;

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

fn criterion_benchmark(c: &mut Criterion) {
    let s = Repr::Empty;
    c.bench_function("empty repr", |b| b.iter(|| black_box(s.clone())));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

For a full benchmark, you may refer to: https://github.com/volo-rs/faststr/blob/main/benches/faststr.rs

Related PR: volo-rs/faststr#6
And commit: volo-rs/faststr@342bdc9

Furthermore, I've tried the following methods, but none helps:

only change INLINE_CAP to 24
change INLINE_CAP to 22 and added a padding to the Inline variant: Inline {_pad: u64,len: u8,buf: [u8; INLINE_CAP],},
change INLINE_CAP to 22 and add a new struct Inline without the _pad field

To change the INLINE_CAP to 22 is only for not increasing the size of FastStr itself when add an extra padding, so the performance is nothing to do with it.

Edit: related discussions users.rust-lang.org, reddit

The text was updated successfully, but these errors were encountered:

Noratrieb · 2023-12-23T12:40:56Z

Without looking at it with much detail, I suspect this is caused by niche optimizations triggering, introducing more complicated branches. #[repr(C)] may help as well.

PureWhiteWu · 2023-12-23T12:49:33Z

Without looking at it with much detail, I suspect this is caused by niche optimizations triggering, introducing more complicated branches. #[repr(C)] may help as well.

Hi, thanks very much for your reply! Do you mean the compiler alignment may lead to some optimizations being disabled?

Noratrieb · 2023-12-23T12:57:20Z

I tried to minimize it into a godbolt: https://godbolt.org/z/zaeGe9Evh

Noratrieb · 2023-12-23T13:03:09Z

Cleaning up the assembly by hand:

before

do_clone:
        movzx   eax, byte ptr [rsi]
        lea     rcx, [rip + .LJTI0_0]
        movsxd  rax, dword ptr [rcx + 4*rax]
        add     rax, rcx
        jmp     rax
.Empty:
        xor     eax, eax
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.Bytes:
        mov     rax, qword ptr [rsi + 8]
        mov     qword ptr [rdi + 8], rax
        mov     al, 1
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.ArcStr:
        mov     rax, qword ptr [rsi + 8]
        mov     rcx, qword ptr [rsi + 16]
        lock            inc     qword ptr [rax]
        jle     .LBB0_10
        mov     qword ptr [rdi + 8], rax
        mov     qword ptr [rdi + 16], rcx
        mov     al, 2
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.ArcString:
        mov     rax, qword ptr [rsi + 8]
        lock            inc     qword ptr [rax]
        jle     .LBB0_10
        mov     qword ptr [rdi + 8], rax
        mov     al, 3
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.StaticStr:
        movups  xmm0, xmmword ptr [rsi + 8]
        movups  xmmword ptr [rdi + 8], xmm0
        mov     al, 4
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.Inline:
        movzx   eax, byte ptr [rsi + 1]
        mov     rcx, qword ptr [rsi + 32]
        mov     qword ptr [rdi + 32], rcx
        movups  xmm0, xmmword ptr [rsi + 18]
        movups  xmmword ptr [rdi + 18], xmm0
        movups  xmm0, xmmword ptr [rsi + 2]
        movups  xmmword ptr [rdi + 2], xmm0
        mov     byte ptr [rdi + 1], al
        mov     al, 5
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.unreachable:
        ud2
        ud2
.LJTI0_0:
        .long   .LBB0_1-.LJTI0_0
        .long   .LBB0_2-.LJTI0_0
        .long   .LBB0_3-.LJTI0_0
        .long   .LBB0_5-.LJTI0_0
        .long   .LBB0_7-.LJTI0_0
        .long   .LBB0_8-.LJTI0_0

after

do_clone_after:
        mov     rax, qword ptr [rsi]
        lea     rcx, [rip + .LJTI1_0]
        movsxd  rdx, dword ptr [rcx + 4*rax]
        add     rdx, rcx
        jmp     rdx
.Empty:
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.Bytes:
        mov     rcx, qword ptr [rsi + 8]
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.ArcStr:
        mov     rcx, qword ptr [rsi + 8]
        mov     rdx, qword ptr [rsi + 16]
        lock            inc     qword ptr [rcx]
        jle     .LBB1_5
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi + 16], rdx
.ArcString:
        mov     rcx, qword ptr [rsi + 8]
        lock            inc     qword ptr [rcx]
        jle     .LBB1_5
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.StaticStr:
        movups  xmm0, xmmword ptr [rsi + 8]
        movups  xmmword ptr [rdi + 8], xmm0
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.Inline:
        mov     rcx, qword ptr [rsi + 8]
        mov     rdx, qword ptr [rsi + 32]
        mov     qword ptr [rdi + 32], rdx
        movups  xmm0, xmmword ptr [rsi + 16]
        movups  xmmword ptr [rdi + 16], xmm0
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.unreachable:
        ud2
        ud2
.LJTI1_0:
        .long   .LBB1_9-.LJTI1_0
        .long   .LBB1_1-.LJTI1_0
        .long   .LBB1_2-.LJTI1_0
        .long   .LBB1_4-.LJTI1_0
        .long   .LBB1_6-.LJTI1_0
        .long   .LBB1_7-.LJTI1_0

Noratrieb · 2023-12-23T13:06:51Z

-Zprint-type-sizes

print-type-size type: `before::Repr`: 40 bytes, alignment: 8 bytes
print-type-size     discriminant: 1 bytes
print-type-size     variant `Inline`: 39 bytes
print-type-size         field `.len`: 1 bytes
print-type-size         field `.buf`: 38 bytes
print-type-size     variant `ArcStr`: 23 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 16 bytes, alignment: 8 bytes
print-type-size     variant `StaticStr`: 23 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 16 bytes, alignment: 8 bytes
print-type-size     variant `Bytes`: 15 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size     variant `ArcString`: 15 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size     variant `Empty`: 0 bytes

print-type-size type: `after::Repr`: 40 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `Inline`: 32 bytes
print-type-size         field `.len`: 8 bytes
print-type-size         field `.buf`: 24 bytes
print-type-size     variant `ArcStr`: 16 bytes
print-type-size         field `.0`: 16 bytes
print-type-size     variant `StaticStr`: 16 bytes
print-type-size         field `.0`: 16 bytes
print-type-size     variant `Bytes`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `ArcString`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `Empty`: 0 bytes

before gets a lot more padding, which is probably correlated with the more verbose assembly.

Noratrieb · 2023-12-23T13:15:52Z

Looking at the assembly more, I think the issue here comes from the alignment allowing the discriminant to be bigger. The bigger discriminant causes less code to be emitted because it doesn't have to bother carefully setting just one byte to zero, it can just write back the discriminant that it read. Even though it would be allowed to write a full 8 byte discriminant for every variant except Inline, it never does.

PureWhiteWu · 2023-12-23T17:37:16Z

@Nilstrieb Hi, thanks very much for your investigation and explanation!
I wonder the alignment issue may even cause the Empty variant clone cost boost from 4ns to 40ns?

the8472 · 2023-12-23T17:42:10Z

You posted this in at least 3 different places. It would be good to link to the others to avoid duplicated effort.
users.rust-lang.org, reddit

PureWhiteWu · 2023-12-24T03:53:13Z

You posted this in at least 3 different places. It would be good to link to the others to avoid duplicated effort. users.rust-lang.org, reddit

Thanks very much! I have added these links to the description!

LYF1999 · 2023-12-25T08:37:10Z

there is the output of read_volatile. https://godbolt.org/z/rznoTfevb
maybe it's because the difference of instructions

LYF1999 · 2023-12-25T08:42:41Z

hm....
this is the llvm type of two versions

%"after::Repr" = type { i64, [4 x i64] }
%"before::Repr" = type { i8, [39 x i8] }

maybe the layout of before::Repr is too bad?

alexpyattaev · 2024-02-25T18:25:59Z

Given that you only have like 6 variants in the enum, and you need the discriminant for the enum anyway, why not just make like 24 additional versions of InLine for each possible length? You'd save a bunch of size this way as well, and I'm quite certain the resulting thing will be easier to optimize when your inline length is known at compile time.

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 23, 2023

PureWhiteWu changed the title ~~Enum struct field cause performance degradation about 10x~~ Enum field align cause performance degradation about 10x Dec 23, 2023

CPunisher mentioned this issue Sep 7, 2024

feat: inline variants by buffer lengths volo-rs/faststr#17

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enum field align cause performance degradation about 10x #119247

Enum field align cause performance degradation about 10x #119247

PureWhiteWu commented Dec 23, 2023 •

edited

Loading

Noratrieb commented Dec 23, 2023

PureWhiteWu commented Dec 23, 2023

Noratrieb commented Dec 23, 2023

Noratrieb commented Dec 23, 2023 •

edited

Loading

Noratrieb commented Dec 23, 2023 •

edited

Loading

Noratrieb commented Dec 23, 2023

PureWhiteWu commented Dec 23, 2023

the8472 commented Dec 23, 2023 •

edited

Loading

PureWhiteWu commented Dec 24, 2023

LYF1999 commented Dec 25, 2023

LYF1999 commented Dec 25, 2023

alexpyattaev commented Feb 25, 2024

Enum field align cause performance degradation about 10x #119247

Enum field align cause performance degradation about 10x #119247

Comments

PureWhiteWu commented Dec 23, 2023 • edited Loading

Noratrieb commented Dec 23, 2023

PureWhiteWu commented Dec 23, 2023

Noratrieb commented Dec 23, 2023

Noratrieb commented Dec 23, 2023 • edited Loading

Noratrieb commented Dec 23, 2023 • edited Loading

Noratrieb commented Dec 23, 2023

PureWhiteWu commented Dec 23, 2023

the8472 commented Dec 23, 2023 • edited Loading

PureWhiteWu commented Dec 24, 2023

LYF1999 commented Dec 25, 2023

LYF1999 commented Dec 25, 2023

alexpyattaev commented Feb 25, 2024

PureWhiteWu commented Dec 23, 2023 •

edited

Loading

Noratrieb commented Dec 23, 2023 •

edited

Loading

Noratrieb commented Dec 23, 2023 •

edited

Loading

the8472 commented Dec 23, 2023 •

edited

Loading