Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enum field align cause performance degradation about 10x #119247

Open
PureWhiteWu opened this issue Dec 23, 2023 · 12 comments
Open

Enum field align cause performance degradation about 10x #119247

PureWhiteWu opened this issue Dec 23, 2023 · 12 comments
Labels
A-layout Area: Memory layout of types C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@PureWhiteWu
Copy link

PureWhiteWu commented Dec 23, 2023

Hi, I'm the author of FastStr crate, and recently I found a wired problem that the clone cost of FastStr is really high. For example, an empty FastStr clone costs about 40ns on amd64 compared to about 4ns of a normal String.

The FastStr itself is a newtype of the inner Repr, which previously has the following layout:

#[derive(Clone)]
pub type FastStr(Repr);

#[cfg(all(test, target_pointer_width = "64"))]
mod size_asserts {
    static_assertions::assert_eq_size!(super::FastStr, [u8; 40]); // 40 bytes
}

const INLINE_CAP: usize = 38;

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

Playground link for old version

After some time of investigation, I found that this is because the Repr::Inline part has really great affect on the performance. And after I added a padding to the Repr::Inline variant(change the type of len from u8 to usize), the performance of clone a Repr::Empty(and other variants all) boosts about 9x from 40ns to 4ns. But the root cause is still not clear:

const INLINE_CAP: usize = 24; // This is becuase I don't want to enlarge the size of FastStr

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: usize, buf: [u8; INLINE_CAP] },
}

Playground link for new version

A simple criterion benchmark code for the old version:

use bytes::Bytes;
use std::sync::Arc;
use criterion::{black_box, criterion_group, criterion_main, Criterion};

const INLINE_CAP: usize = 38;

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

fn criterion_benchmark(c: &mut Criterion) {
    let s = Repr::Empty;
    c.bench_function("empty repr", |b| b.iter(|| black_box(s.clone())));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

For a full benchmark, you may refer to: https://github.com/volo-rs/faststr/blob/main/benches/faststr.rs

Related PR: volo-rs/faststr#6
And commit: volo-rs/faststr@342bdc9

Furthermore, I've tried the following methods, but none helps:

  1. only change INLINE_CAP to 24
  2. change INLINE_CAP to 22 and added a padding to the Inline variant: Inline {_pad: u64,len: u8,buf: [u8; INLINE_CAP],},
  3. change INLINE_CAP to 22 and add a new struct Inline without the _pad field

To change the INLINE_CAP to 22 is only for not increasing the size of FastStr itself when add an extra padding, so the performance is nothing to do with it.

Edit: related discussions users.rust-lang.org, reddit

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 23, 2023
@PureWhiteWu PureWhiteWu changed the title Enum struct field cause performance degradation about 10x Enum field align cause performance degradation about 10x Dec 23, 2023
@Noratrieb Noratrieb added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-layout Area: Memory layout of types C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Dec 23, 2023
@Noratrieb
Copy link
Member

Without looking at it with much detail, I suspect this is caused by niche optimizations triggering, introducing more complicated branches. #[repr(C)] may help as well.

@PureWhiteWu
Copy link
Author

Without looking at it with much detail, I suspect this is caused by niche optimizations triggering, introducing more complicated branches. #[repr(C)] may help as well.

Hi, thanks very much for your reply! Do you mean the compiler alignment may lead to some optimizations being disabled?

@Noratrieb
Copy link
Member

I tried to minimize it into a godbolt: https://godbolt.org/z/zaeGe9Evh

@Noratrieb
Copy link
Member

Noratrieb commented Dec 23, 2023

Cleaning up the assembly by hand:

before
do_clone:
        movzx   eax, byte ptr [rsi]
        lea     rcx, [rip + .LJTI0_0]
        movsxd  rax, dword ptr [rcx + 4*rax]
        add     rax, rcx
        jmp     rax
.Empty:
        xor     eax, eax
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.Bytes:
        mov     rax, qword ptr [rsi + 8]
        mov     qword ptr [rdi + 8], rax
        mov     al, 1
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.ArcStr:
        mov     rax, qword ptr [rsi + 8]
        mov     rcx, qword ptr [rsi + 16]
        lock            inc     qword ptr [rax]
        jle     .LBB0_10
        mov     qword ptr [rdi + 8], rax
        mov     qword ptr [rdi + 16], rcx
        mov     al, 2
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.ArcString:
        mov     rax, qword ptr [rsi + 8]
        lock            inc     qword ptr [rax]
        jle     .LBB0_10
        mov     qword ptr [rdi + 8], rax
        mov     al, 3
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.StaticStr:
        movups  xmm0, xmmword ptr [rsi + 8]
        movups  xmmword ptr [rdi + 8], xmm0
        mov     al, 4
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.Inline:
        movzx   eax, byte ptr [rsi + 1]
        mov     rcx, qword ptr [rsi + 32]
        mov     qword ptr [rdi + 32], rcx
        movups  xmm0, xmmword ptr [rsi + 18]
        movups  xmmword ptr [rdi + 18], xmm0
        movups  xmm0, xmmword ptr [rsi + 2]
        movups  xmmword ptr [rdi + 2], xmm0
        mov     byte ptr [rdi + 1], al
        mov     al, 5
        mov     byte ptr [rdi], al
        mov     rax, rdi
        ret
.unreachable:
        ud2
        ud2
.LJTI0_0:
        .long   .LBB0_1-.LJTI0_0
        .long   .LBB0_2-.LJTI0_0
        .long   .LBB0_3-.LJTI0_0
        .long   .LBB0_5-.LJTI0_0
        .long   .LBB0_7-.LJTI0_0
        .long   .LBB0_8-.LJTI0_0
after
do_clone_after:
        mov     rax, qword ptr [rsi]
        lea     rcx, [rip + .LJTI1_0]
        movsxd  rdx, dword ptr [rcx + 4*rax]
        add     rdx, rcx
        jmp     rdx
.Empty:
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.Bytes:
        mov     rcx, qword ptr [rsi + 8]
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.ArcStr:
        mov     rcx, qword ptr [rsi + 8]
        mov     rdx, qword ptr [rsi + 16]
        lock            inc     qword ptr [rcx]
        jle     .LBB1_5
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi + 16], rdx
.ArcString:
        mov     rcx, qword ptr [rsi + 8]
        lock            inc     qword ptr [rcx]
        jle     .LBB1_5
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.StaticStr:
        movups  xmm0, xmmword ptr [rsi + 8]
        movups  xmmword ptr [rdi + 8], xmm0
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.Inline:
        mov     rcx, qword ptr [rsi + 8]
        mov     rdx, qword ptr [rsi + 32]
        mov     qword ptr [rdi + 32], rdx
        movups  xmm0, xmmword ptr [rsi + 16]
        movups  xmmword ptr [rdi + 16], xmm0
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi], rax
        mov     rax, rdi
        ret
.unreachable:
        ud2
        ud2
.LJTI1_0:
        .long   .LBB1_9-.LJTI1_0
        .long   .LBB1_1-.LJTI1_0
        .long   .LBB1_2-.LJTI1_0
        .long   .LBB1_4-.LJTI1_0
        .long   .LBB1_6-.LJTI1_0
        .long   .LBB1_7-.LJTI1_0

@Noratrieb
Copy link
Member

Noratrieb commented Dec 23, 2023

-Zprint-type-sizes

print-type-size type: `before::Repr`: 40 bytes, alignment: 8 bytes
print-type-size     discriminant: 1 bytes
print-type-size     variant `Inline`: 39 bytes
print-type-size         field `.len`: 1 bytes
print-type-size         field `.buf`: 38 bytes
print-type-size     variant `ArcStr`: 23 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 16 bytes, alignment: 8 bytes
print-type-size     variant `StaticStr`: 23 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 16 bytes, alignment: 8 bytes
print-type-size     variant `Bytes`: 15 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size     variant `ArcString`: 15 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size     variant `Empty`: 0 bytes
print-type-size type: `after::Repr`: 40 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `Inline`: 32 bytes
print-type-size         field `.len`: 8 bytes
print-type-size         field `.buf`: 24 bytes
print-type-size     variant `ArcStr`: 16 bytes
print-type-size         field `.0`: 16 bytes
print-type-size     variant `StaticStr`: 16 bytes
print-type-size         field `.0`: 16 bytes
print-type-size     variant `Bytes`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `ArcString`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `Empty`: 0 bytes

before gets a lot more padding, which is probably correlated with the more verbose assembly.

@Noratrieb
Copy link
Member

Looking at the assembly more, I think the issue here comes from the alignment allowing the discriminant to be bigger. The bigger discriminant causes less code to be emitted because it doesn't have to bother carefully setting just one byte to zero, it can just write back the discriminant that it read. Even though it would be allowed to write a full 8 byte discriminant for every variant except Inline, it never does.

@PureWhiteWu
Copy link
Author

@Nilstrieb Hi, thanks very much for your investigation and explanation!
I wonder the alignment issue may even cause the Empty variant clone cost boost from 4ns to 40ns?

@the8472
Copy link
Member

the8472 commented Dec 23, 2023

You posted this in at least 3 different places. It would be good to link to the others to avoid duplicated effort.
users.rust-lang.org, reddit

@PureWhiteWu
Copy link
Author

You posted this in at least 3 different places. It would be good to link to the others to avoid duplicated effort. users.rust-lang.org, reddit

Thanks very much! I have added these links to the description!

@LYF1999
Copy link
Contributor

LYF1999 commented Dec 25, 2023

there is the output of read_volatile. https://godbolt.org/z/rznoTfevb
maybe it's because the difference of instructions

@LYF1999
Copy link
Contributor

LYF1999 commented Dec 25, 2023

hm....
this is the llvm type of two versions

%"after::Repr" = type { i64, [4 x i64] }
%"before::Repr" = type { i8, [39 x i8] }

maybe the layout of before::Repr is too bad?

@alexpyattaev
Copy link

Given that you only have like 6 variants in the enum, and you need the discriminant for the enum anyway, why not just make like 24 additional versions of InLine for each possible length? You'd save a bunch of size this way as well, and I'm quite certain the resulting thing will be easier to optimize when your inline length is known at compile time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-layout Area: Memory layout of types C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants