Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BasicCABI] Pass compiler-rt 128-bit return values in memory #223

Closed
wants to merge 1 commit into from

Conversation

aheejin
Copy link
Member

@aheejin aheejin commented Mar 19, 2024

Currently our LLVM Wasm backend returns 128-bit values as two i64s in case multivalue is enabled:
https://github.com/llvm/llvm-project/blob/a5f576e5961ecc099bd7ccf8565da090edc84b0d/llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp#L697-L700

But given that neither emscripten nor wasm-sdk seem to provide a multivalue version of compiler-rt, it looks this has not been working so far and the reason we haven't heard complaints was likely that no one was using compiler-rt with multivalue enabled.

Maintaining and providing two different versions of compiler-rt is a cumbersome thing for toolchains, and emscripten already has to provide multiple versions of many libraries (e.g. threaded vs. non-threaded, debug vs. release, exception-enabled vs. disabled, ...).

Also enabling the multivalue return on several compiler-rt functions that have a 128-bit return value wouldn't affect performance in a meaningful way, given that there are not many of them.

I had a chance to chat with several people who contribute here offline this morning, and it looked we agreed that there is not much benefit to enabling multivalue return in compiler-rt functions. One thing I'm not sure is whether we decided to disable 128-bit multivalue returns for only compiler-rt functions or for all user functions. This PR currently says we do that only for compiler-rt; please let me know if you think we should do otherwise.

Currently our LLVM Wasm backend returns 128-bit values as two `i64`s in
case multivalue is enabled:
https://github.com/llvm/llvm-project/blob/a5f576e5961ecc099bd7ccf8565da090edc84b0d/llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp#L697-L700

But given that neither emscripten nor wasm-sdk seem to provide a
multivalue version of compiler-rt, it looks this has not been working so
far and the reason we haven't heard complaints was likely that no one
was using compiler-rt with multivalue enabled.

Maintaining and providing two different versions of compiler-rt is a
cumbersome thing for toolchains, and emscripten already has to provide
multiple versions of many libraries (e.g. threaded vs. non-threaded,
debug vs. release, exception-enabled vs. disabled, ...).

Also enabling the multivalue return on several compiler-rt functions
that have a 128-bit return value wouldn't affect performance in a
meaningful way, given that there are not many of them.

I had a chance to chat with several people who contribute here offline
this morning, and it looked we agreed that there is not much benefit to
enabling multivalue return in compiler-rt functions. One thing I'm not
sure is whether we decided to disable 128-bit multivalue returns for
only compiler-rt functions or for all user functions. This PR currently
says we do that only for compiler-rt; please let me know if you think we
should do otherwise.
@tlively
Copy link
Member

tlively commented Mar 19, 2024

One thing I'm not sure is whether we decided to disable 128-bit multivalue returns for only compiler-rt functions or for all user functions. This PR currently says we do that only for compiler-rt; please let me know if you think we should do otherwise.

Since the general principle we've been discussing is that changing feature flags should never change the ABI, I think it makes sense to disable 128-bit multivalue returns everywhere. If we want, we can continue allowing them in the experimental multivalue ABI, though, since that is a separate ABI.

@fitzgen
Copy link
Contributor

fitzgen commented Mar 19, 2024

Also enabling the multivalue return on several compiler-rt functions that have a 128-bit return value wouldn't affect performance in a meaningful way, given that there are not many of them.

In Wasmtime, we've seen a lot of bottlenecks on compiler-rt's __muilti3, especially in various cryptography functions, and the two big reasons this is a bottleneck is because (1) we don't do inlining in Cranelift and assume that the toolchain has already done it, which isn't generally the case for compiler-rt functions because they are so late in the pipeline IIUC, and (2) because the returns are passed through memory rather than on the Wasm stack, and we can't generally prove that the memory accesses won't trap or that the stores aren't otherwise unobserved, so we have to perform them.

More details/investigation by @jameysharp over here: bytecodealliance/wasmtime#4077 (comment)

It would be a shame if we precluded returning i28s directly from compiler-rt, and standardized (2) as a permanent bottleneck, barring new core wasm instructions specifically for eliminating __multi3: WebAssembly/design#1495.

@TerrorJack
Copy link
Contributor

This will also hurt users that do compile their own entire sysroot with experimental mv abi and expect compiler-rt to not have in-memory overhead with i128s.

@sbc100
Copy link
Member

sbc100 commented Mar 19, 2024

which isn't generally the case for compiler-rt functions because they are so late in the pipeline IIUC,

Doesn't wasm-opt take care of inlining that function? Or is there some toolchain that doesn't include running wasm-opt on the final binary?

@jedisct1
Copy link
Member

Doesn't wasm-opt take care of inlining that function? Or is there some toolchain that doesn't include running wasm-opt on the final binary?

The Zig toolchain always recompiles the sysroot with the same flags (including wasm extensions) as the rest of the code, and doesn't run wasm-opt.

[1] In case of parameters, `long long double`, `__int128`, and `__float128` are
passed directly as two `i64` values. In case of return values, for
non-compiler-rt functions, they are passed directly as two `i64` values if
[multivalue](https://github.com/WebAssembly/multi-value) is enabled and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true is it? Enable multi-value doesn't change the ABI, right?

We do have an experimental multi-value ABI but its not official, and we probably don't want to be mentioning it at all here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this once we decide what to do with the feature flag / experimental flag / ABI.

@sbc100
Copy link
Member

sbc100 commented Mar 19, 2024

Doesn't wasm-opt take care of inlining that function? Or is there some toolchain that doesn't include running wasm-opt on the final binary?

The Zig toolchain always recompiles the sysroot with the same flags (including wasm extensions) as the rest of the code, and doesn't run wasm-opt.

So is there room to add a wasm-opt step there to optimize the final output? Or does zig already take care of things like inlining itself? Regardless of the toolchain it should be possible to do whole program optimizing such as inlining __multi3, right?

@jedisct1
Copy link
Member

So is there room to add a wasm-opt step there to optimize the final output?

This is very unlikely. The toolchain is self-contained, and doesn't spawn external commands.

Or does zig already take care of things like inlining itself? Regardless of the toolchain it should be possible to do whole program optimizing such as inlining __multi3, right?

For apps written in Zig, it can take care of inlining. Not for apps written in C/C++.

__int128 mul(__int128 a, __int128 b) {  return a * b; }
zig cc -O3 -c --target=wasm32-freestanding -mcpu=baseline+multivalue a.c
(import "env" "__multi3" (func $fimport$0 (param i64 i64 i64 i64) (result i64 i64)))

@sbc100
Copy link
Member

sbc100 commented Mar 19, 2024

So is there room to add a wasm-opt step there to optimize the final output?

This is very unlikely. The toolchain is self-contained, and doesn't spawn external commands.

Or does zig already take care of things like inlining itself? Regardless of the toolchain it should be possible to do whole program optimizing such as inlining __multi3, right?

For apps written in Zig, it can take care of inlining. Not for apps written in C/C++.

__int128 mul(__int128 a, __int128 b) {  return a * b; }
zig cc -O3 -c --target=wasm32-freestanding -mcpu=baseline+multivalue a.c

Can I ask what baseline+multivalue does? Does it turn simple enable the multi-value feature, or does it turn on the experimental multi-value C ABI in clang?

Our current plan for clang/llvm is that just turning on the feature shouldn't effect the C ABI at all.

@jedisct1
Copy link
Member

Can I ask what baseline+multivalue does? Does it turn simple enable the multi-value feature, or does it turn on the experimental multi-value C ABI in clang?

baseline is a synonym for mvp and adding +multivalue to the target CPU does the same thing as -mmultivalue.

array | indirect | N/A |

[1] In case of parameters, `long long double`, `__int128`, and `__float128` are
passed directly as two `i64` values. In case of return values, for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] appears once in the "Parameter" column, so I'm confused about seeing "in case of return values" here. Should that part be in [2]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it should go to [2]. Given that whether we are gonna do this or not is unclear, will fix it later (if we go this route)

@kripken
Copy link
Member

kripken commented Mar 19, 2024

The Zig toolchain always recompiles the sysroot with the same flags (including wasm extensions) as the rest of the code, and doesn't run wasm-opt.

LTO is another way to get such functions inlined. Does Zig support that?

Otherwise, in general, a huge advantage of wasm over typical targets is exactly that it is very simple to inline calls even without LTO. So even if a toolchain like Zig doesn't want to run an external tool like wasm-opt (fair enough) then it can implement such inlining itself, without much code or effort. (Inlining heuristics can be tricky, but for a specific toolchain like Zig it could just say "always inline __muilti3" which it controls anyhow.)

For those reasons I'd suggest not worrying much about the inlining aspect here.

@aheejin
Copy link
Member Author

aheejin commented Mar 19, 2024

These are my current thoughts:

I agree with #223 (comment) that we shouldn't treat compiler-rt and other libraries differently. One big motivator for this PR was it's hard for toolchains to maintain two different versions of libraries, and it turns out not only compiler-rt but other libraries like libc and libc++ also have i128 return values. But from the compiler's point of view, those library functions, many of which are not listed in https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/RuntimeLibcalls.def, are not any different from user functions, so there's no way of disabling multivalue returns for only libraries and not user functions. If we disable it only for the runtime libcalls listed in https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/RuntimeLibcalls.def, it doesn't end up solving the two-libraries problem in the toolchain. (In practice emscripten already maintains several different versions of certain libraries (e.g. EH vs. non-EH, multithreaded vs. non-multhreaded, ...), so adding multivalue vs. non-multivalue will double the number of already existing combinations of builds.)

Unless you are not building all your libraries from scratch by yourself, which Zig seems to do, disabling multivalue return doesn't regress the status quo for most of the users, given that neither emscripten nor wasi-sdk currently provides multivalue returning versions of libraries. But I also understand the concern that it's suboptimal to preclude the possibility of returning multivalue in the future for better performance. I had assumed the usage of those functions wouldn't affect the performance in a meaningful way but apparently there were cases they did.

I chatted w/ @dschuff offline and we agreed that while enabling a certain feature can (and usually do) change the output code, it might not be desirable for that to change the ABI, and changing the ABI should be dependent on a separate flag. And we don't need to distinguish compiler-rt vs. non-compiler-rt or library vs. user code. We already have an experimental flag that enables multivalue for struct return values: -target-abi -Xclang experimental-mv. Currently this only controls structs and not 128-bit values, which are controlled by -mmultivalue instead, which we think should be changed.

So our suggestion is, how about making the lowering of 128-bit return values also depend on -target-abi -Xclang experimental-mv? This is a clang flag and transferring this info available to the backend may involve some hack but I think it's doable (@tlively, do you think it's unreasonably hard?) .

Downsides of this approach can be 1. current users of multivalue-returning feature have to opt in to use that option 2.
if they want to enable multivalue returns for 128-bit values, they have to enable multivalue returns for structs too, because the single option controls both. I think this can be a reasonable compromise and I'm not really sure if we want one more experimental flag that exclusively controls 128-bit value returns. One thing I'd like to fix before doing this is, according to @tlively, we don't have a limit on the size of struct that can be returned by multivalue, so if we have a function that returns a struct with 100 i32s, it will return 100 values via multivalue. I think we should put some limit on this, which can be some arbitrary number that sounds reasonable, e.g., 4 or something. I remember @tlively did some experiment on this a long time ago and don't remember the result was, but I don't think we want to invest more time on redoing the experiments to pick an optimal number now. @tlively, do you have a suggestion on a reasonable limit?

I'd like to hear what people think about this way forward (making -target-abi -Xclang experimental-mv control all multivalue return). Thank you for all the feedback!

@TerrorJack
Copy link
Contributor

As long as -mmultivalue -Xclang -target-abi -Xclang experimental-mv does not regress from current behavior of always passing i128 via multi value (even for compiler-rt when compiled from source) then it's fine for us. It would be even nicer to mention this in the ABI document, along with a warning sign that says it's highly experimental and require recompiling sysroot, but that's just my two cents.

I think we should put some limit on this

That's better discussed on another thread, but my option is such limits, if any, should be treated as "Implementation Limitations" in core wasm spec.

@alexcrichton
Copy link
Collaborator

If I might throw out another possible alternative, could it perhaps also be possible to write this down in source itself?

For example compiler-rt is specific to a toolchain version (or so I believe) which means that even if the default C ABI is to not use multi-value compiler-rt itself could use multi-value. That would enable defining __multi3, for example, using a multi-value-based-return without affecting everyone else's C ABI. This wouldn't require multiple builds with/without multivalue, instead it's a decision one day of whether __multi3, and only __multi3 (or other compiler-rt intrinsics) should use multi-value for their return values.

This would also solve a problem of being able to intermix multi-value-returns-and-not. For example the outermost function might want to use multi-value as it's trying to match an exact API "shape" of the wasm module being created, but everything else just wants to interoperate with the system which is likely not going to use multi-value for some time now. Enabling the two ABIs to be present in the same module would help avoid making multi-value an all-or-nothing decision.

In such a world I would imagine that -target-abi -Xclang experimental-mv also would not change, it's switching the default ABI away to something else.

@sbc100
Copy link
Member

sbc100 commented Mar 19, 2024

In Wasmtime, we've seen a lot of bottlenecks on compiler-rt's __muilti3, especially in various cryptography functions

Is the problem here that the cryptography functions are compiling with the assumption that they have native int128 support? Perhaps if they would prefer a different codepath if they knew that int128 were being software emulated? i.e. is there some HAVE_NATIVE_INT128 macro that they are getting wrong for wasm?

@aheejin
Copy link
Member Author

aheejin commented Mar 19, 2024

@TerrorJack

I think we should put some limit on this

That's better discussed on another thread, but my option is such limits, if any, should be treated as "Implementation Limitations" in core wasm spec.

The limitation we want to impose here is not the "limitation" in the core spec sense, which is used for the maximum number of something that the engine implementation supports:
https://webassembly.github.io/spec/core/appendix/implementation.html
https://webassembly.github.io/spec/js-api/index.html#limits

What I suggested, the number of items in a struct that will be returned as multivalue, which I assume to be less than 10, is more of an internal parameter that drives the optimization. It's not that the implementation cannot handle 1000 return values; it's just that it's bad for the code quality and we wouldn't want to do that. I think it might be worth mentioning it in this ABI document, but I don't think that should go into the core spec.

@aheejin
Copy link
Member Author

aheejin commented Mar 19, 2024

@alexcrichton

If I might throw out another possible alternative, could it perhaps also be possible to write this down in source itself?

For example compiler-rt is specific to a toolchain version (or so I believe) which means that even if the default C ABI is to not use multi-value compiler-rt itself could use multi-value. That would enable defining __multi3, for example, using a multi-value-based-return without affecting everyone else's C ABI. This wouldn't require multiple builds with/without multivalue, instead it's a decision one day of whether __multi3, and only __multi3 (or other compiler-rt intrinsics) should use multi-value for their return values.

If __multi3's return type changes, wouldn't the caller (user code) need know about it to match the signature? If __multi3 is user code we would have a declaration in the user code somewhere, but if it's compiler-rt, we don't know the signature, and that's why we have the list of compiler-rt functions and their signatures here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp Doesn't this count as a change of ABI?

@sbc100
Copy link
Member

sbc100 commented Mar 19, 2024

@alexcrichton

If I might throw out another possible alternative, could it perhaps also be possible to write this down in source itself?
For example compiler-rt is specific to a toolchain version (or so I believe) which means that even if the default C ABI is to not use multi-value compiler-rt itself could use multi-value. That would enable defining __multi3, for example, using a multi-value-based-return without affecting everyone else's C ABI. This wouldn't require multiple builds with/without multivalue, instead it's a decision one day of whether __multi3, and only __multi3 (or other compiler-rt intrinsics) should use multi-value for their return values.

If __multi3's return type changes, wouldn't the caller (user code) need know about it to match the signature?

I believe the libcall function such as __multi3 are not designed to be directly callable from C code. They only need to be callable from the llvm backend, when it generates calls to them programatically.

So in theory we can use whatever ABI we like and I think it is allowed to differ from the normal C ABI. See https://github.com/llvm/llvm-project/blob/12a2bc301fe83eea3b214428827d712c8cfb28a9/compiler-rt/lib/builtins/int_lib.h#L23-L31

@fitzgen
Copy link
Contributor

fitzgen commented Mar 20, 2024

I believe the libcall function such as __multi3 are not designed to be directly callable from C code. They only need to be callable from the llvm backend, when it generates calls to them programatically.

So in theory we can use whatever ABI we like and I think it is allowed to differ from the normal C ABI. See https://github.com/llvm/llvm-project/blob/12a2bc301fe83eea3b214428827d712c8cfb28a9/compiler-rt/lib/builtins/int_lib.h#L23-L31

Using a custom, libcall-specific ABI that avoids touching memory for __multi3 seems like it would be ideal in this case.

@sbc100
Copy link
Member

sbc100 commented Mar 20, 2024

I believe the libcall function such as __multi3 are not designed to be directly callable from C code. They only need to be callable from the llvm backend, when it generates calls to them programatically.
So in theory we can use whatever ABI we like and I think it is allowed to differ from the normal C ABI. See https://github.com/llvm/llvm-project/blob/12a2bc301fe83eea3b214428827d712c8cfb28a9/compiler-rt/lib/builtins/int_lib.h#L23-L31

Using a custom, libcall-specific ABI that avoids touching memory for __multi3 seems like it would be ideal in this case.

To be clear, are you proposing that this libcall-specific ABI would vary with -mmulti-value, so toolchains would all need to provider specific -mmulti-value versions of their libcall library? (regardless of whether we ever ship a new multi-value C ABI).

I addition to the complexity of shipping and selecting such as version of compiler-rt, I think building it could be very tricky since it would there is no value we could use COMPILER_RT_ABI that would cause the compiler to generate this ABI. We would need to come up with kind of intermediate ABI that compiler-rt could be compiled with. That sound like a fairly significant amount of work.

@fitzgen
Copy link
Contributor

fitzgen commented Mar 20, 2024

I was assuming that in a world with universal support for multi-value (effectively the state of the world now, but I recognize that this is a separate, larger discussion) that the __multi3/libcall ABI would always use a multi-value or v128 return (doesn't matter which one, so long as it is not using a return pointer) regardless what C ABI (today's vs a future multi-value ABI) is in use.

If I understand the constraints correctly, and given that assumption of universal multi-value support regardless of C ABI in use, this wouldn't require maintaining multiple compiler-rt versions nor concerns around selecting the correct version. Is that correct?

@sbc100
Copy link
Member

sbc100 commented Mar 20, 2024

I was assuming that in a world with universal support for multi-value (effectively the state of the world now, but I recognize that this is a separate, larger discussion) that the __multi3/libcall ABI would always use a multi-value or v128 return (doesn't matter which one, so long as it is not using a return pointer) regardless what C ABI (today's vs a future multi-value ABI) is in use.

If I understand the constraints correctly, and given that assumption of universal multi-value support regardless of C ABI in use, this wouldn't require maintaining multiple compiler-rt versions nor concerns around selecting the correct version. Is that correct?

True, once multi-value is universally available we could just assume it when we build compiler-rt.

In emscripten we will want to continue to support targets without multivalue (i.e. user of -mno-multivalue), however the way we tend to handle this kind thing today is to build with the feature enabled and then use a binaryen pass to lower it away. So that shouldn't be a problem for emscripten anyway.

I am still curious to hear about from you regarding the importance of optimizing the software emulation of int128: #223 (comment). Are you sure we really need to care about it enough to invent a new ABI that differs from the normal C ABI?

@jedisct1
Copy link
Member

In Wasmtime, we've seen a lot of bottlenecks on compiler-rt's __muilti3, especially in various cryptography functions

Is the problem here that the cryptography functions are compiling with the assumption that they have native int128 support? Perhaps if they would prefer a different codepath if they knew that int128 were being software emulated? i.e. is there some HAVE_NATIVE_INT128 macro that they are getting wrong for wasm?

The assumption is that if the TI mode is supported or the __i128t type is available, we should use it, even if this is a software emulation, because on native targets, this will be inlined. A different code path would reimplement the same thing.

The intuition is that WebAssembly would behave the same way. The fact that this is not the case, even with LTO, was not expected https://gist.github.com/jedisct1/dfc18913680c24824595b50b8eb04db3

@fitzgen
Copy link
Contributor

fitzgen commented Mar 20, 2024

Is the problem here that the cryptography functions are compiling with the assumption that they have native int128 support? Perhaps if they would prefer a different codepath if they knew that int128 were being software emulated? i.e. is there some HAVE_NATIVE_INT128 macro that they are getting wrong for wasm?

I'm personally not sure about these particulars. @jameysharp might have a better idea, as he was the one doing the original investigation.

@sbc100
Copy link
Member

sbc100 commented Mar 20, 2024

because on native targets, this will be inlined

Can you explain what you mean here? Have you seen native platform target that provide bitcode/LTO-able versions of compiler-rt?

As far as I know there are no native targets that provide LTO-able versions of compiler-rt functions. The reason I think this is that in emscripten (where are pushing the limits of what we make LTO-able) we still haven't found a way to make compiler-rt function compile as bitcode (i.e. LTO-able). See https://github.com/emscripten-core/emscripten/blob/10827283603d0d8dba0a813827b5ee5f82d1adf5/tools/system_libs.py#L922-L924

@aheejin
Copy link
Member Author

aheejin commented Mar 20, 2024

@sbc100

True, once multi-value is universally available we could just assume it when we build compiler-rt.

In emscripten we will want to continue to support targets without multivalue (i.e. user of -mno-multivalue), however the way we tend to handle this kind thing today is to build with the feature enabled and then use a binaryen pass to lower it away. So that shouldn't be a problem for emscripten anyway.

Can we lower multivalue away completely? We can do that within a function or across functions in static linking + non-exported/imported functions, but can we do that with imported/exported functions or dynamic linking?

@sbc100
Copy link
Member

sbc100 commented Mar 20, 2024

@sbc100

True, once multi-value is universally available we could just assume it when we build compiler-rt.
In emscripten we will want to continue to support targets without multivalue (i.e. user of -mno-multivalue), however the way we tend to handle this kind thing today is to build with the feature enabled and then use a binaryen pass to lower it away. So that shouldn't be a problem for emscripten anyway.

Can we lower multivalue away completely? We can do that within a function or across functions in static linking + non-exported/imported functions, but can we do that with imported/exported functions or dynamic linking?

Before we enable multi-value by default we would need some way to lower it away for users targeting older browsers. If we can't lower it away then that would be a one reason not to support the use of multi-value for libcalls purely on the basis of the feature exsiting (-mmulti-value)

@jedisct1
Copy link
Member

because on native targets, this will be inlined

Can you explain what you mean here? Have you seen native platform target that provide bitcode/LTO-able versions of compiler-rt?

I haven't, but at least on mips64, riscv64, aarch64 and x86_64, a 128-bit multiplication doesn't call the compiler-rt implementation at all.

$ for target in mips64-linux riscv64-linux aarch64-linux x86_64-linux; do
       zig cc -S -target "$target" -O3 a.c && ug -q __multi3 a.s && echo "[$target]: __multi3 called";
   done

(empty)

But on WebAssembly, even if the __multi3 is inlined, will WebAssembly compilers understand the pattern and be able to take advantage of CPU instructions for wide multiplications?

@aheejin
Copy link
Member Author

aheejin commented Mar 20, 2024

@alexcrichton

If I might throw out another possible alternative, could it perhaps also be possible to write this down in source itself?
For example compiler-rt is specific to a toolchain version (or so I believe) which means that even if the default C ABI is to not use multi-value compiler-rt itself could use multi-value. That would enable defining __multi3, for example, using a multi-value-based-return without affecting everyone else's C ABI. This wouldn't require multiple builds with/without multivalue, instead it's a decision one day of whether __multi3, and only __multi3 (or other compiler-rt intrinsics) should use multi-value for their return values.

If __multi3's return type changes, wouldn't the caller (user code) need know about it to match the signature?

I believe the libcall function such as __multi3 are not designed to be directly callable from C code. They only need to be callable from the llvm backend, when it generates calls to them programatically.

So in theory we can use whatever ABI we like and I think it is allowed to differ from the normal C ABI. See https://github.com/llvm/llvm-project/blob/12a2bc301fe83eea3b214428827d712c8cfb28a9/compiler-rt/lib/builtins/int_lib.h#L23-L31

Ah I see. Yeah I think code annotations can be one way to support different ABIs for certain performance-critical functions. But if we decide to do this in future, it seems also compatible with the clang option I suggested in #223 (comment). Also given that some people (especially who own the whole toolchain) would want to enable multivalue lowering for all potential candidates and wouldn't want to annotate every function that has multiple or 128-bit return values, having that kind of option can be convenient.

@dschuff
Copy link
Member

dschuff commented Mar 21, 2024

I think it does make sense in general to allow the calling convention ABI to be determined on a per-function basis, with a clang attribute you can attach to a function declaration that propagates down through the IR via an attribute or some similar mechanism. This is how ABIs work on ARM, which has a similar proliferation of optional features and calling conventions (e.g. you can mark a function with __attribute__((pcs("aapcs"))) to make it use the AAPCS calling convention). The nice thing about that is that, as @alexcrichton mentioned above, you can then mix functions or object files with different conventions and everything works as long as the declarations are correct.
I think this would be a good idea in general. It doesn't quite solve the problem for compiler-rt functions on its own because compiler-rt functions don't have any declarations, they are just emitted directly by the compiler. So if we wanted to support more than 1 ABI for compiler-rt we'll still need some kind of flag to control which ABI it uses.
Right now, the only people using multivalue who would regress with what @aheejin proposed above are already using the experimental multivalue ABI for everything. So as a first step, I do think it would make sense to make the expirimental-mv-abi flag control compiler-rt as well. That will decouple the calling convention from just enabling the feature, which I think everyone agrees is desirable.

Then we can address the best way to optimize the performance and use of the calling convention and performance of __multi3 and other compiler-rt functions. We could decide to use MV by default if available upstream, which would be going back to what we have today, except that for emscripten we'd ether provide some other flag to control it, or use a Binaryen pass to lower it away. Or we can decide to decouple the flags, or decide to just tweak the current MV ABI and use it in more toolchains, but even going back-and-forth like this would still be better since it decouples the timeline for that work from blocking exnref (which is what precipitated all this in the first place).

Separately I think we should also just add widening multiply, add-with-carry, and maybe a few other related instructions to wasm. Those have always been fairly uncontroversial but just punted until MV is available everywhere, so maybe that time is now. That's obviously a bit slower process than these toolchain changes could be, so maybe we do both.

aheejin added a commit to aheejin/llvm-project that referenced this pull request Mar 21, 2024
Emscripten EH/SjLj uses invoke wrappers in JS, such as `invoke_vi`, from
which user functions are called indirectly, to emulate exceptions and
setjmp/longjmp.

But in case the invoked function returns multiple values or a 128-bit
value, its the JS invoke wrappers cannot return multivalue because JS
doesn't support that. So we should not enable multivalue returns for the
JS invoke wrappers and also the functions called by those JS wrappers
because their signature has to match with the JS wrapper. For example,
if `func` returns `{i32, i32}` and we have
```ll
  invoke {i32, i32} @func() ...
```
while LowerEmscriptenEHSjLj will lower it down to something like
```ll
  %0 = call { i32, i32 } @"__invoke_{i32.i32}_void"(ptr @func)
  ...
```
we should eventually lower both the invoke wrapper (whose name will be
changed later to `invoke_vi`) and `func` down to a signature that
indirectly returns multiple values by memory parameter, because JS
invoke wrappers do support multiple return values.

So we need to disable multivalue returns for JS invoke wrappers and
functions called by them. I think we have three ways to do that:

1. Make a set and add all functions that are invoked by JS invoke
   wrappers in LowerEmscriptenEHSjLj and pass it to the backend using an
   auxiliary data structure. We have a precedent of this kind of
   structure already, which is used for Wasm EH:
   https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/CodeGen/WasmEHFuncInfo.h
   We can even add this set to `WasmEHFuncInfo` maybe, given that this
   is also a way of handling exceptions in Wasm.
   - Pros: Most precise
   - Cons: Auxiliary structure needed

2. Unless we record the invoked functions in LowerEmscriptenEHSjLj like
   1, we don't have a way of precisely knowing the set of invoked
   functions. But they are all indirectly invoked, so in the backend we
   can check whether a given function is ever indirectly used (i.e., its
   pointer taken) by traversing its `users()` in the IR and if it is,
   we don't allow multivalue returns for them.
   - Pros: No auxiliary structure
   - Cons: IR checking overhead. More conservative than 1.

3. Disallow all multivalue returns when Emscripten EH or SjLj is
   enabled.
   - Pros: Simplest
   - Cons: Most conservative.

This PR is doing 3. While it is the most conservative and possibly
disallowing multivalue returns from more functions than needed, I chose
this because it is the simplest, and given that hopefully more people
will adopt Wasm EH going forward, I don't think there will be many
people who would use multivalue and Emscripten EH/SjLj together and want
the whatever performance benefit that multivalue return can bring, given
that Emscripten EH/SjLj has already a huge performance cost.

This is separate from whether we should make the multivalue return
dependent on the multivalue feature or something else like a clang flag,
which is being partly discussed in
WebAssembly/tool-conventions#223.
Whichever way we decide on that front, we still need to disable
multivalue returns in case Emscripten EH/SjLj is used.
@waterlens
Copy link

I think it does make sense in general to allow the calling convention ABI to be determined on a per-function basis, with a clang attribute you can attach to a function declaration that propagates down through the IR via an attribute or some similar mechanism. This is how ABIs work on ARM, which has a similar proliferation of optional features and calling conventions (e.g. you can mark a function with __attribute__((pcs("aapcs"))) to make it use the AAPCS calling convention). The nice thing about that is that, as @alexcrichton mentioned above, you can then mix functions or object files with different conventions and everything works as long as the declarations are correct. I think this would be a good idea in general. It doesn't quite solve the problem for compiler-rt functions on its own because compiler-rt functions don't have any declarations, they are just emitted directly by the compiler. So if we wanted to support more than 1 ABI for compiler-rt we'll still need some kind of flag to control which ABI it uses. Right now, the only people using multivalue who would regress with what @aheejin proposed above are already using the experimental multivalue ABI for everything. So as a first step, I do think it would make sense to make the expirimental-mv-abi flag control compiler-rt as well. That will decouple the calling convention from just enabling the feature, which I think everyone agrees is desirable.

Then we can address the best way to optimize the performance and use of the calling convention and performance of __multi3 and other compiler-rt functions. We could decide to use MV by default if available upstream, which would be going back to what we have today, except that for emscripten we'd ether provide some other flag to control it, or use a Binaryen pass to lower it away. Or we can decide to decouple the flags, or decide to just tweak the current MV ABI and use it in more toolchains, but even going back-and-forth like this would still be better since it decouples the timeline for that work from blocking exnref (which is what precipitated all this in the first place).

Separately I think we should also just add widening multiply, add-with-carry, and maybe a few other related instructions to wasm. Those have always been fairly uncontroversial but just punted until MV is available everywhere, so maybe that time is now. That's obviously a bit slower process than these toolchain changes could be, so maybe we do both.

I agree with @dschuff that we should push things like widening multiplication or similar things into wasm. There's a related issue: WebAssembly/design#1495.

@tlively
Copy link
Member

tlively commented Mar 25, 2024

So our suggestion is, how about making the lowering of 128-bit return values also depend on -target-abi -Xclang experimental-mv? This is a clang flag and transferring this info available to the backend may involve some hack but I think it's doable (@tlively, do you think it's unreasonably hard?) .

No, I agree it's hacky but doable.

I remember @tlively did some experiment on this a long time ago and don't remember the result was, but I don't think we want to invest more time on redoing the experiments to pick an optimal number now. @tlively, do you have a suggestion on a reasonable limit?

I did some experiments that showed performance wins from multivalue function return, but those experiments did not say what an optimal size would be. Basically multivalue return is always faster than returning through memory, but at some point the potential code size bloat would not be worth it any more. Dan's suggestion of 4 seems reasonable to me.

aheejin added a commit to aheejin/llvm-project that referenced this pull request Apr 12, 2024
Multivalue feature of WebAssembly has been standardized for several
years now. I think it makes sense to be able to enable it in the feature
section by default for our clang/llvm-produced binaries so that the
multivalue feature can be used as necessary when necessary within our
toolchain and also when running other optimizers (e.g. wasm-opt) after
the LLVM code generation.

But some WebAssembly toolchains, such as Emscripten, do not provide
both mulvalue-returning and not-multivalue-returning versions of
libraries. Also allowing the uses of multivalue in the features section
does not necessarily mean we generate them whenever we can to the
fullest, which is a different code generation / optimization option.

So this makes the lowering of multivalue returns conditional on the use
of 'experimental-mv' target ABI. This ABI is turned off by default and
turned on by passing `-Xclang -target-abi -Xclang experimental-mv` to
`clang`, or `-target-abi experimental-mv` to `clang -cc1` or `llc`.

But the purpose of this PR is not tying the multivalue lowering to this
specific 'experimental-mv'. 'experimental-mv' is just one multivalue ABI
we currently have, and it is still experimental, meaning it is not very
well optimized or tuned for performance. (e.g. it does not have the
limitation of the max number of multivalue-lowered values, which can be
detrimental to performance.) We may change the name of this ABI, or
improve it, or add a new multivalue ABI in the future. Also I heard that
WASI is planning to add their multivalue ABI soon. So the plan is,
whenever any one of multivalue ABIs is enabled, we enable the lowering
of multivalue returns in the backend. We currently have only
'experimental-mv' in the repo so we only check for that in this PR.

Related past discussions:
 llvm#82714
WebAssembly/tool-conventions#223 (comment)
@aheejin
Copy link
Member Author

aheejin commented Apr 12, 2024

I submitted a PR to do what I proposed above: llvm/llvm-project#88492

This makes the lowering multivalue returns in the backend conditional to the use of 'experimental-mv' option. But I plan to extend this to other multivalue ABIs as they are added; if WASI adds their multivalue ABI, the backend can also detect that ABI and enable multivalue return lowering.

@jameysharp
Copy link

Is the problem here that the cryptography functions are compiling with the assumption that they have native int128 support? Perhaps if they would prefer a different codepath if they knew that int128 were being software emulated? i.e. is there some HAVE_NATIVE_INT128 macro that they are getting wrong for wasm?

I'm personally not sure about these particulars. @jameysharp might have a better idea, as he was the one doing the original investigation.

In case this still matters: I was testing a Rust implementation of RSA, compiled to WebAssembly and then run on Wasmtime on x86-64. What I found was a ~10% improvement in the speed of RSA operations by changing the underlying multi-precision integer library from 64-bit limbs to 32-bit, which fortunately it had a build-time configuration option for.

When that library did multiplication, it needed double-width temporaries, so with 64-bit limbs it used 128-bit multiplies implemented by libcalls to __multi3. With 32-bit limbs it used 64-bit multiplies which are natively available in WebAssembly. I opened an issue with that library about preferring 32-bit limbs on WebAssembly targets but I don't think it went anywhere.

I believe, but haven't proven yet, that in Wasmtime we have the ability to write optimization rules which will transform the arithmetic in __multi3 back into a native 128-bit multiply. But there's no way we can optimize away the memory accesses that result from returning the pair of results on the shadow stack, because by the time we receive this code, we're constrained by WebAssembly semantics to faithfully implement those writes to linear memory. A post-compilation inlining pass has the same problem. So I'm in favor of any plan that allows these libcalls to use multivalue returns.

jedisct1 added a commit to jedisct1/openssl-wasm that referenced this pull request Apr 12, 2024
aheejin added a commit to llvm/llvm-project that referenced this pull request Apr 23, 2024
…88492)

Multivalue feature of WebAssembly has been standardized for several
years now. I think it makes sense to be able to enable it in the feature
section by default for our clang/llvm-produced binaries so that the
multivalue feature can be used as necessary when necessary within our
toolchain and also when running other optimizers (e.g. wasm-opt) after
the LLVM code generation.

But some WebAssembly toolchains, such as Emscripten, do not provide both
mulvalue-returning and not-multivalue-returning versions of libraries.
Also allowing the uses of multivalue in the features section does not
necessarily mean we generate them whenever we can to the fullest, which
is a different code generation / optimization option.

So this makes the lowering of multivalue returns conditional on the use
of 'experimental-mv' target ABI. This ABI is turned off by default and
turned on by passing `-Xclang -target-abi -Xclang experimental-mv` to
`clang`, or `-target-abi experimental-mv` to `clang -cc1` or `llc`.

But the purpose of this PR is not tying the multivalue lowering to this
specific 'experimental-mv'. 'experimental-mv' is just one multivalue ABI
we currently have, and it is still experimental, meaning it is not very
well optimized or tuned for performance. (e.g. it does not have the
limitation of the max number of multivalue-lowered values, which can be
detrimental to performance.) We may change the name of this ABI, or
improve it, or add a new multivalue ABI in the future. Also I heard that
WASI is planning to add their multivalue ABI soon. So the plan is,
whenever any one of multivalue ABIs is enabled, we enable the lowering
of multivalue returns in the backend. We currently have only
'experimental-mv' in the repo so we only check for that in this PR.

Related past discussions:
 #82714
WebAssembly/tool-conventions#223 (comment)
@sunfishcode
Copy link
Member

@aheejin What is the current status of this PR?

@aheejin
Copy link
Member Author

aheejin commented Dec 11, 2024

@sunfishcode We ended up doing llvm/llvm-project#88492 instead.
Relevant comments:
#223 (comment)
#223 (comment)

Will close this.

@aheejin aheejin closed this Dec 11, 2024
@aheejin aheejin deleted the compiler_rt_abi branch December 11, 2024 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.