Alternative or addition: instructions with overflow/carry flags #6

alexcrichton · 2024-08-14T15:22:51Z

Perhaps the primary alternative to this proposal that would try to solve the same original problem would be to add instructions that manipulate/expose the overflow flag directly from native hardware. This is lower level than 128-bit operations themselves and the theory is that 128-bit operations could be built from these lower level instructions. The original proposal explicitly did not propose this due to complexity of optimizing these instructions and the ease of achieving performance on desired benchmarks with 128-bit operations. In the CG meeting however it was brought up that these lower-level primitives might be a better choice.

This issue is intended to be a discussion location for continuing to flesh out the flags-vs-128-bit-ops story. More thorough rationale will be needed in later phases to either explicitly use flag-based ops or not. This will likely be informed through experience in implementations in other engines in addition to performance numbers of relevant benchmarks.

alexcrichton · 2024-08-20T18:57:59Z

I've opened #10 with some more discussion behind the background of this proposal. It notably outlines the possible shape of these instructions plus why they had bad performance in Wasmtime with the initial implementation.

alexcrichton · 2024-08-26T20:29:00Z

One concern I might have with going exclusively with overflow/carry flags is that platforms like RISC-V don't have the same instructions as x86/aarch64 and they might be slower. For example implementing i64.add_with_carry_u on RISC-V to achieve 128-bit addition might end up being slower than what RISC-V already generates today. I'm not aware of other platforms missing these instructions, though, so RISC-V may be the only one affected and I additionally do not have a machine to benchmark on to confirm this hypothesis one way or another.

waterlens · 2024-09-03T14:42:18Z

One concern I might have with going exclusively with overflow/carry flags is that platforms like RISC-V don't have the same instructions as x86/aarch64 and they might be slower. For example implementing i64.add_with_carry_u on RISC-V to achieve 128-bit addition might end up being slower than what RISC-V already generates today. I'm not aware of other platforms missing these instructions, though, so RISC-V may be the only one affected and I additionally do not have a machine to benchmark on to confirm this hypothesis one way or another.

It is more likely to be a problem with the design of RISC-V ISA itself, as suggested by this discussion. It could be a big sacrifice for WASM if we give up such common operations on other hardware platforms, just because RISC-V people currently don't care about this.

alexcrichton · 2024-09-04T20:48:36Z

That's a good point! I should clarify that it's something I think is worth measuring, but I don't think it's a showstopper preventing such an alternative such as this.

Orthogonally I wanted to flesh out some more numbers on this of what I'm measuring locally. Using the benchmarks in https://github.com/alexcrichton/wasm-benchmark-i128 I measured the slowdown relative-to-native of wasm today, with this proposal as-is with 128-bit ops, and then with this alternative instead of overflow/carry flags (plus i64.mul_wide_*). Each column in the table is a different wasm binary benchmarked and each row is a benchmark. Lower is better as it means it's closer to native. This was collected with Wasmtime 25ish and on Linux x86_64.

Benchmark	Wasm Today	Wasm + 128-bit-arithmetic	Wasm + overflow/carry/mul_wide
blind-sig	617%	74%	79%
lehmer	362%	12%	-7%
mul-bignum	391%	30%	111%
fib_10000	122%	15%	100%
shl	93%	92%	93%
shr	98%	98%	95%
div-bignum	345%	160%	234%
sort signed	63%	64%	64%
sort unsigned	72%	72%	69%

Summary,:

blind-sig - both approaches are roughly equivalent as this is a multiplication-heavy benchmark
lehmer - this shows mostly that mul_wide can sometimes optimize better than mul128. I'll note this in Is mul128 going to be as fast as widening multiplication? #11.
mul-bignum / div-bignum - these benchmarks have more addition than previous ones and show that the overflow/carry approach is significantly slower than add128
fib_10000 - this shows add128 performing much better than overflow/carry
others - not affected

The most notable result is that add128 is performing much better in addition operations than overflow/carry operations. This may be counterintuitive to expectation. I've outlined the reason as to why in the explainer a bit but to summarize that here the most general lowerings of overflow/carry ops require moving the carry flag out of the EFLAGS register and back into it. That in turn causes significant slowdown if it's not optimized away, and Wasmtime/Cranelift do not have a means of optimizing it away. In the 128-bit op case that more closely matches the original program semantics

Some recent [benchmarking] had a surprising result I wasn't trying to dig for. Notably as summarized in #11 some more use cases of widening multiplication being more optimal than 128-by-128 bit multiplication have started to arise. Coupled with local benchmarking confirming that both on x64 and aarch64 that widening multiplication has more support in LLVM for more optimal lowerings and was easier to implement in Wasmtime than the 128-by-128 bit multiplication once various optimizations were implemented. In the end `i64.mul128`, which was primarily motivated by "feels cleaner" and "should have the same performance" as widening multiplication, does not appear to have the expected performance/implementation tradeoff. Getting an as-performant `i64.mul128` instruction relative to `i64.mul_wide_{s,u}` has required more work than expected and so the balance of concerns has me now tipping away from `i64.mul128`, despite it being "less clean" compared to the add/sub opcodes proposed in this PR. Closes #11 [benchmarking]: #6 (comment)

alexcrichton · 2024-09-20T22:05:57Z

I've posted some more thoughts to record in the overview in #18

waterlens · 2024-09-22T15:25:19Z

Basically, I agree with your opinion. The correct modeling of how the carry flag works in the compiler is rare. There is a trade-off between less effort and optimal performance. The instructions to be added can not be perfect semantically, especially considering there aren't many people dedicated to the work. The current speedup is already considerable, so please go ahead and push this proposal upstream!

sparker-arm · 2024-12-11T10:15:59Z

Hi @alexcrichton

most general lowerings of overflow/carry ops require moving the carry flag out of the EFLAGS register and back into it

I'm not an x86 person, so could you clarify how your are lowering these operations? Can't a 'wide' add be lowered to add and adc, without having to explicitly move data from flags? Or do you just mean this is happening an a micro-arch level?

I mainly ask because the mul_wide operations sound good, but I expected to see add/sub with the same signature.

alexcrichton · 2024-12-11T15:44:40Z

Hello! And sure! I might also point towards the overview docs which is a summary of the phase 2 presentation I last gave to the CG.

You're correct that i64.add128 gets lowered to add + adc (and add + adds on aarch64). The problem that I was describing there was i64.add_overflow_s, a separate instruction, which produces two values: the addition result + overflow bit. That overflow bit, in the most general case, needs to be put in a general purpose register which is the "move out of EFLAGS". Later if that's then combined with i64.add_with_carry_s which takes three inputs (e.g. lhs/rhs/overflow bit) then to use adc (or adds) then the overflow bit needs to be "put back in EFLAGS" somehow since adc and adds don't actually take three register input operands, only 2 and one is implicitly in the flags register.

sparker-arm · 2024-12-11T16:08:24Z

Ah yes, I understand now. Flags are a pain... :)

Thanks.

alexcrichton mentioned this issue Aug 23, 2024

Should 128-bit comparison operators be added? #4

Closed

alexcrichton mentioned this issue Aug 28, 2024

128-bit arithmetic WebAssembly/design#1522

Open

alexcrichton mentioned this issue Sep 4, 2024

Is mul128 going to be as fast as widening multiplication? #11

Closed

alexcrichton mentioned this issue Sep 5, 2024

Remove i64.mul128, add i64.mul_wide_{s,u} #13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative or addition: instructions with overflow/carry flags #6

Alternative or addition: instructions with overflow/carry flags #6

alexcrichton commented Aug 14, 2024

alexcrichton commented Aug 20, 2024

alexcrichton commented Aug 26, 2024

waterlens commented Sep 3, 2024

alexcrichton commented Sep 4, 2024

alexcrichton commented Sep 20, 2024

waterlens commented Sep 22, 2024 •

edited

Loading

sparker-arm commented Dec 11, 2024

alexcrichton commented Dec 11, 2024

sparker-arm commented Dec 11, 2024

Alternative or addition: instructions with overflow/carry flags #6

Alternative or addition: instructions with overflow/carry flags #6

Comments

alexcrichton commented Aug 14, 2024

alexcrichton commented Aug 20, 2024

alexcrichton commented Aug 26, 2024

waterlens commented Sep 3, 2024

alexcrichton commented Sep 4, 2024

alexcrichton commented Sep 20, 2024

waterlens commented Sep 22, 2024 • edited Loading

sparker-arm commented Dec 11, 2024

alexcrichton commented Dec 11, 2024

sparker-arm commented Dec 11, 2024

waterlens commented Sep 22, 2024 •

edited

Loading