-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enum field align cause performance degradation about 10x #119247
Comments
Without looking at it with much detail, I suspect this is caused by niche optimizations triggering, introducing more complicated branches. |
Hi, thanks very much for your reply! Do you mean the compiler alignment may lead to some optimizations being disabled? |
I tried to minimize it into a godbolt: https://godbolt.org/z/zaeGe9Evh |
Cleaning up the assembly by hand: beforedo_clone:
movzx eax, byte ptr [rsi]
lea rcx, [rip + .LJTI0_0]
movsxd rax, dword ptr [rcx + 4*rax]
add rax, rcx
jmp rax
.Empty:
xor eax, eax
mov byte ptr [rdi], al
mov rax, rdi
ret
.Bytes:
mov rax, qword ptr [rsi + 8]
mov qword ptr [rdi + 8], rax
mov al, 1
mov byte ptr [rdi], al
mov rax, rdi
ret
.ArcStr:
mov rax, qword ptr [rsi + 8]
mov rcx, qword ptr [rsi + 16]
lock inc qword ptr [rax]
jle .LBB0_10
mov qword ptr [rdi + 8], rax
mov qword ptr [rdi + 16], rcx
mov al, 2
mov byte ptr [rdi], al
mov rax, rdi
ret
.ArcString:
mov rax, qword ptr [rsi + 8]
lock inc qword ptr [rax]
jle .LBB0_10
mov qword ptr [rdi + 8], rax
mov al, 3
mov byte ptr [rdi], al
mov rax, rdi
ret
.StaticStr:
movups xmm0, xmmword ptr [rsi + 8]
movups xmmword ptr [rdi + 8], xmm0
mov al, 4
mov byte ptr [rdi], al
mov rax, rdi
ret
.Inline:
movzx eax, byte ptr [rsi + 1]
mov rcx, qword ptr [rsi + 32]
mov qword ptr [rdi + 32], rcx
movups xmm0, xmmword ptr [rsi + 18]
movups xmmword ptr [rdi + 18], xmm0
movups xmm0, xmmword ptr [rsi + 2]
movups xmmword ptr [rdi + 2], xmm0
mov byte ptr [rdi + 1], al
mov al, 5
mov byte ptr [rdi], al
mov rax, rdi
ret
.unreachable:
ud2
ud2
.LJTI0_0:
.long .LBB0_1-.LJTI0_0
.long .LBB0_2-.LJTI0_0
.long .LBB0_3-.LJTI0_0
.long .LBB0_5-.LJTI0_0
.long .LBB0_7-.LJTI0_0
.long .LBB0_8-.LJTI0_0 afterdo_clone_after:
mov rax, qword ptr [rsi]
lea rcx, [rip + .LJTI1_0]
movsxd rdx, dword ptr [rcx + 4*rax]
add rdx, rcx
jmp rdx
.Empty:
mov qword ptr [rdi], rax
mov rax, rdi
ret
.Bytes:
mov rcx, qword ptr [rsi + 8]
mov qword ptr [rdi + 8], rcx
mov qword ptr [rdi], rax
mov rax, rdi
ret
.ArcStr:
mov rcx, qword ptr [rsi + 8]
mov rdx, qword ptr [rsi + 16]
lock inc qword ptr [rcx]
jle .LBB1_5
mov qword ptr [rdi + 8], rcx
mov qword ptr [rdi + 16], rdx
.ArcString:
mov rcx, qword ptr [rsi + 8]
lock inc qword ptr [rcx]
jle .LBB1_5
mov qword ptr [rdi + 8], rcx
mov qword ptr [rdi], rax
mov rax, rdi
ret
.StaticStr:
movups xmm0, xmmword ptr [rsi + 8]
movups xmmword ptr [rdi + 8], xmm0
mov qword ptr [rdi], rax
mov rax, rdi
ret
.Inline:
mov rcx, qword ptr [rsi + 8]
mov rdx, qword ptr [rsi + 32]
mov qword ptr [rdi + 32], rdx
movups xmm0, xmmword ptr [rsi + 16]
movups xmmword ptr [rdi + 16], xmm0
mov qword ptr [rdi + 8], rcx
mov qword ptr [rdi], rax
mov rax, rdi
ret
.unreachable:
ud2
ud2
.LJTI1_0:
.long .LBB1_9-.LJTI1_0
.long .LBB1_1-.LJTI1_0
.long .LBB1_2-.LJTI1_0
.long .LBB1_4-.LJTI1_0
.long .LBB1_6-.LJTI1_0
.long .LBB1_7-.LJTI1_0 |
before gets a lot more padding, which is probably correlated with the more verbose assembly. |
Looking at the assembly more, I think the issue here comes from the alignment allowing the discriminant to be bigger. The bigger discriminant causes less code to be emitted because it doesn't have to bother carefully setting just one byte to zero, it can just write back the discriminant that it read. Even though it would be allowed to write a full 8 byte discriminant for every variant except |
@Nilstrieb Hi, thanks very much for your investigation and explanation! |
You posted this in at least 3 different places. It would be good to link to the others to avoid duplicated effort. |
Thanks very much! I have added these links to the description! |
there is the output of |
hm....
maybe the layout of |
Given that you only have like 6 variants in the enum, and you need the discriminant for the enum anyway, why not just make like 24 additional versions of InLine for each possible length? You'd save a bunch of size this way as well, and I'm quite certain the resulting thing will be easier to optimize when your inline length is known at compile time. |
Hi, I'm the author of
FastStr
crate, and recently I found a wired problem that the clone cost ofFastStr
is really high. For example, an emptyFastStr
clone costs about 40ns on amd64 compared to about 4ns of a normal String.The
FastStr
itself is a newtype of the innerRepr
, which previously has the following layout:Playground link for old version
After some time of investigation, I found that this is because the
Repr::Inline
part has really great affect on the performance. And after I added a padding to theRepr::Inline
variant(change the type oflen
fromu8
tousize
), the performance of clone aRepr::Empty
(and other variants all) boosts about 9x from 40ns to 4ns. But the root cause is still not clear:Playground link for new version
A simple criterion benchmark code for the old version:
For a full benchmark, you may refer to: https://github.com/volo-rs/faststr/blob/main/benches/faststr.rs
Related PR: volo-rs/faststr#6
And commit: volo-rs/faststr@342bdc9
Furthermore, I've tried the following methods, but none helps:
INLINE_CAP
to 24INLINE_CAP
to 22 and added a padding to the Inline variant:Inline {_pad: u64,len: u8,buf: [u8; INLINE_CAP],},
INLINE_CAP
to 22 and add a new structInline
without the_pad
fieldTo change the
INLINE_CAP
to 22 is only for not increasing the size ofFastStr
itself when add an extra padding, so the performance is nothing to do with it.Edit: related discussions users.rust-lang.org, reddit
The text was updated successfully, but these errors were encountered: