-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partially out-of-bound writes #1490
Comments
As one point, deterministic behaviour helps to generally ensure that different implementation decisions in different engines don't lead to observably divergent behaviours (even on a single underlying system). Presumably this would be relevant here when comparing engines implementing load/store with explicit bounds check vs guard pages. How do the browsers using guard pages currently achieve conformant behaviour on Arm? |
I'm not sure for Firefox: here's a lengthy thread about the issues they've found and it's not just Arm specific. For V8, though I don't think I've actually enabled it yet for Chromium, we can rely on trap handlers for loads but have to perform manual bounds checks for stores.
Sure, but I'm still struggling to understand in what situations it is useful, I don't understand how it helps once we trap. Can it help in some debug situations? |
Sorry, I rushed to write the previous comment and didn't work through the reasoning properly. On the Web, it's possible to catch Wasm traps in JS and continue execution (and observe the state of the memory). We have a lot of previous experience with tiny edge-cases of the semantics being relied on by some Web program so there would be some inertia against introducing further non-determinism here. That being said, if browsers currently aren't (or even can't) implementing this conformantly (as the Firefox issue above suggests), that would be an argument that we could relax things, since existing Web programs couldn't be relying on conformant behaviour. |
Okay, thanks. From bug trackers though, it seems it only became 'a problem' once the necessary tests were added to the wasm spec test-suite. And I'd just like re-iterate that most (all) wasm runtimes only pass this test, on Arm hardware, because of microarchitecture details (luck!) so it's not just browsers who don't rely on conformant programs... it's basically everything. It seems it is both difficult to implement two policies and it also mean giving up ~10-20% execution performance when using explicit checks. |
Yeah, if being truly conformant requires engines to totally give up on the guard page strategy, I would (personally) count that as "can't implement conformantly" and (personally) argue that the semantics should be relaxed. @eqrion am I understanding correctly that today in "release" Firefox, it's possible to craft a program that (in some platform) can observably write bytes to memory in the case of a partially OOB write? |
Even if we were to relax this, we would need to fix a behaviour in deterministic mode, which some platforms have a hard reliance on. Is the only solution for them to not use guard pages, or not use Arm? |
I don't have an answer of how else to get around it... Maybe the penalty of explicit checks is worth the absolute determinism? Having to exclude an underlying hardware implementation doesn't sound very wasmy. Would you mind sending me some link(s) to the aforementioned platforms? I would really like to know about some concrete use-cases. EDIT: On a single aarch64 Linux system, I found the execution time overhead of explicit checks, for stores only, to be ~5% across an arbitrary set of benchmarks. |
One particular sensitive class of examples is Blockchain platforms running Wasm, because they rely on consensus between replicated distributed computations, possibly across heterogeneous hardware. That is only meaningful if all executions are guaranteed to be fully deterministic, including in cases where traps occur. There is bunch of these platforms, such as Dfinity, Parity, etc. |
Thank you! I feared blockchain maybe the answer :) So, sorry to get a little side-tracked, but I think I'm getting lost among the jargon... I see you are/were affiliated with Dfinity, so maybe you know for them, does their platform run on aarch64? And do you know whether most of these platforms are using wasmtime/cranelift? (IIRC, cranelift does not produce conforming code.) I can see that Parity have their own interpreter, do you know if rolling-your-own interpreters/compilers is a popular choice? |
Does the blockchain use case require consensus on failed executions? I'd have expected that a result is thrown out if it ends in a trap, and that that is effectively the way to get to deterministic outcomes for that case. |
@sparker-arm Can you clarify which arm architectures might be affected? Does this apply to modern processors and arm64, or is this limited to certain classes of arm processors? |
To be pedantic, no arm architecture provides the guaranteed semantics. It is implementation defined. The problems, generally, do not arise during testing because it's either running in an emulator or on a microarchitecture that does provide the semantics. For arm designed cores, it can be loosely described as the big cores match wasm but the small cores do not. I don't have an exhaustive list. Apple silicon doesn't suffer from this inconsistency. |
@sparker-arm, afaik, Dfinity currently only runs on x64 nodes, but it is not intended to stay like that. For their use case at least, an interpreter would be too slow, because it's a 3rd gen blockchain that is meant to support full-featured on-chain applications. @tschneidereit, consensus also applies to trapping executions, since they have to materialise as a rejected message at the call site (for distributed message calls), so are an integral and well-defined part of the execution model. That said, I'm sure there are other use cases for deterministic execution that do not care about the failure case. Reproducible execution/builds could fall into that category (not sure). |
FWIW, the overheads can be much greater than that, depending on the benchmark. I've measured ~1.55x slow downs in both Wasmtime and SpiderMonkey on real world programs ( |
@sparker-arm Is there any feature flag or other way to detect whether the underlying hardware does a partial write or not (other than, say, the runtime performing one and catching the signal at startup)? Is it at least guaranteed that the hardware behaves the same way consistently? (Would I'm also curious about other architectures: does anyone know offhand if e.g. RISC-V or other ISAs guarantee anything about partially-trapping writes? |
@sparker-arm one other thought: is it sufficient, architecturally, to do a load (and throw away the result) to a stored-to address before every store? We would then take the fault on the load instead, without side-effects. Or does one need a fence to ensure the store doesn't reorder in that case? |
Yup! As my alter ego 'grubbymits' accidentally mentioned, I believe almost all configurations of big and small would be incompatible. I have a feeling there maybe a single big armv8 that behaves differently to the rest. |
I don't know, I need to speak to one of our memory model experts. I suspect that it may not work without atomics. |
Thinking about the load-before-store idea a bit more: a page fault is a precise exception, in the sense that all side-effects of earlier instructions should occur, and none of later instructions should, correct? (The concern here is just about precision within the sub-operations of a single instruction's write.) In other words, if I have a load that faults on page A, and then later in the program, a store to address B, the architecture should unambiguously state that the store to B does not occur (or may occur speculatively, but we're only concerned with final architectural state here)? Otherwise, you couldn't restart after a page fault, because A and B might alias. |
Even if this does work, I don't really think we should be coming up with hacks to get WebAssembly to run well on a very prominent architecture. It seems clear that the design decision was made without having enough information, and it appears that most runtimes continue to ignore the spec, so maybe the spec should change... From the crypto use-case, it seems they all provide an SDK so they can fully control the target wasm feature set, and so they can disable anything that could introduce in-determinism. If this is the main case for fully deterministic memory, I wonder if we could have this as an opt-in feature? |
@sparker-arm, deterministic mode (introduced with the Relaxed SIMD proposal, which is similarly messy) is what I was getting at. But we'll still need to specify some concrete behaviour for that. I am trying to understand if this mode would even be implementable on Arm without severe performance hits, and what the cheapest semantics would be for that. |
Well, is there any way that we can implement alignment guarantees in WebAssembly? (I'm not sure I understand the utility of alignment hints.) |
Sadly, the answer to all your questions is no. And yes, architecturally speaking there is no requirement that a particular implementation/microarchitecture/CPU behaves consistently, and that is even if we ignore the existence of technologies such as big.LITTLE (e.g. a single-core system). |
It'd still be useful from my perspective at least to have a definitive answer on the load-before-store idea: if the spec doesn't change, then this is the only reasonable option I see to keep competitive Wasm performance on aarch64, unless I'm missing some other option! |
Also, to be precise, on AArch64 the behaviour is not implementation-defined in the sense that an implementation is supposed to choose a specific behaviour and to stick to it - architecturally the contents of the in-bound portion of the memory affected by a store become unknown (so in theory they could become zero, for instance, even if the value to be stored is non-zero). |
@rossberg I'm still trying to understand the mechanics for consensus, do the linear memory of the instances have to be compared? |
@sparker-arm, sort of. In the abstract, each node computes some form of cryptographic hash of the state changes it has performed, including the memory contents. Virtual memory techniques to identify dirtied pages at the end of execution make that practical and fairly efficient even for large memories, because the number of touched pages is bounded by the gas limit. That said, we shouldn't narrow the consideration to the obscure needs of blockchains. Portable, deterministic execution and the absence of undefined behaviour (even where non-determinism is allowed) has been one of the original design goals and a big selling point of Wasm. We somehow need to maintain it. Demoting determinism to an opt-in mode already is a compromise that not everybody was happy with. |
Okay, so I'm now trying to reason what is different about this case of non-determinism compared to the existing standard, and incoming extensions. Given that wasm can produce non-deterministic NaN values, this case doesn't feel wildly different to me. I see that NaN values are normalised with instrumentation in the Dfinity SDK/runtime? Could partial writes be handled in the same way? Maybe even by breaking up stores into byte stores? It just seems there's existing precedence for working around the limitations for runtimes that need it. |
It's in fact Wasmtime that does the NaN normalisation, it has a configuration option for that. That roughly corresponds to the idea of having a deterministic mode, which this engine (and possibly others) supports. It would be okay-ish to only have deterministic behaviour of OOB writes in deterministic mode. But even then we need to resolve two questions:
|
All little cores, AFAIK. The big cores from our Sophia design team would also break wasm, which I believe is the Cortex-A73 and A75.
Just wondering whether you have use for it outside of debug? I would definitely like clarification of what the state of the store should be when a program is reduced to a single trap. |
@sparker-arm, not sure. It would be a value that was neither in the memory before the write nor was written to it. Although thin-air semantics typically means arbitrary value, while this would be a fixed one. @cfallin, (i) does not answer the question what to do in deterministic mode, (ii) would be a severe breaking change -- it is completely legal today to reenter Wasm and access memory after traps, and it appears highly unlikely that no existing web page or application does that in the case of OOB. It would also have quite non-trivial consequences for the relaxed memory model, similar to "shrinking" a live memory. |
But is the state of memory, after a trap, actually well-defined by the spec? This is what I've found:
This doesn't sound like the memory can be inspected to figure out what the program did before the trap executed. |
Yes, absolutely. The (non-normative) bits you quote just give a general intuition that a trap aborts the current computations, including any effect the current instruction might otherwise have. The normative specification for executing stores is here. In particular, the bounds check is performed and results in a trap (step 12a) before any modification of the state is performed (step 15). (This spec becomes much more complicated with the relaxed memory model in Wasm 3+, but the result is the same as fas as single-threaded memory is concerned. For shared memory, it also specifies that no mutation can be observed by concurrent reads, and that OOB checks are implicitly atomic.) |
Thanks, but it's the notion of 'ultimately reducing the entire program to a single trap instruction', which I'm struggling with. Surely a program which is reduced to a single trap shouldn't have any affect on memory at all? |
Ah, yes, I can see how that is confusing. "Program" is meant here in a more formalistic sense, being the current sequence of instructions executed (since entering Wasm), which is "reduced" to a result per a small-step reduction semantics. It doesn't mean that any actual code disappears, that still exists in the function store of the abstract machine and can be reinvoked. |
Allowing partially OOB writes to nondeterministically write any value to the in-bounds region, then tightening that up to say they always write zeroes to the in-bounds region under the deterministic profile sounds ok to me. As an alternative for the deterministic profile, would it be feasible to say that the in-bounds portion of the write will succeed? On the trapping path, the write would essentially have to be repeated byte-by-byte for the in-bounds region, but at least there's no need to know what the previous value in memory was under that scheme. |
@tlively, I think the problem is that the trap handler has no access to the value that was attempted to be written. |
I was assuming that the written value would be easier to provide to the trap handler than the previous in-memory value, but maybe it's still too complicated / slow to arrange for that. |
At least in Wasmtime, it's conceivable that we could include enough information in the trap metadata to replay the store byte-by-byte without having to include a full instruction disassembler in the segfault handler, starting from the register state. That would allow a "zero-cost" approach (in the same sense as for exceptions) where there's no action to save the store data on the common path. I think we'd still prefer the fixed-value (zeroes) approach to avoid that complexity if it's palatable, though I do understand that "in-bound part of the store succeeds" is much cleaner semantics. |
@cfallin, yeah, I can believe that there is a plausible implementation strategy for some engines on some hardware platforms, but as long as it can't easily be done on all relevant hardware platforms, it would seem unwise to prescribe this semantics. |
@cfallin AFAICT It's possible to do this without metadata by decoding the faulting machine store instruction and getting the supplied value from either an immediate or a register (maybe memory, if it's a RMW, but that probably won't work if its a RMW of the same memory location, since per discussion thread the contents of the memory location are now unspecified on arm). Also, given that many different kind of store instructions could trap, the instruction decoder for that could be quite unwieldy. Another option is to use metadata to encode the original (bytecode) PC and then replay the bytecode in an interpreter or JIT a stub. Those are probably out of wasmtime's design criteria, though. Although, given that there is probably a finite number of combinations of stores + regs, precompiling all such stubs might be possible. |
Indeed.
I'm not sure how this would work, without still having to decode the instruction to figure out what data needs to be stored where? Storing zeros removes the need to decide out what to store, but we'd still need to decode to find out the start address of the unaligned access. (Though this is trivial for half-word stores). To avoid any decoding, would it be feasible to generate a side-table to describe each machine-level store? |
It also looks like partial writes are a potential problem for POWER too. From Power ISA Version 3.1B, Book II, Chapter 2 Any other Load or Store instruction may be partially When an instruction is aborted after being partially |
Right, I meant that there'd be enough metadata to go back to executing the instruction in the interpreter, which means both the bytecode PC and the bytecode's inputs (value stack operands), which constitute the address and value stored. If that were compressed, it basically amounts to a sidetable as you describe. Thanks for the additional context for POWER. The upshot of that seems to be basically the same as ARM. |
…ng hardware. As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that *is* legal might succeed. This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above. Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V. This PR implements the idea first proposed [here], namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect. This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified. On an Apple M2 Max machine (aarch64), this was measured to have a 2% performance impact when running `spidermonkey.wasm` with a simple recursive Fibonacci program. It can be used via the `-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime. [here]: WebAssembly/design#1490 (comment)
…ng hardware. As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that *is* legal might succeed. This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above. Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V. This PR implements the idea first proposed [here], namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect. This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified. On an Apple M2 Max machine (aarch64), this was measured to have a 2% performance impact when running `spidermonkey.wasm` with a simple recursive Fibonacci program. It can be used via the `-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime. [here]: WebAssembly/design#1490 (comment)
I've gone ahead and implemented the load-before-store idea in Wasmtime and measured a 2% perf impact on Apple aarch64 hardware (which doesn't see torn stores, but is what I have at hand). It would still be useful to continue discussing the spec change options above, but if it remains as-is, it seems we could implement "precise post-trap memory state" for those who need it with reasonable-ish overhead. |
…ng hardware. As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that *is* legal might succeed. This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above. Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V. This PR implements the idea first proposed [here], namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect. This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified. On an Apple M2 Max machine (aarch64), this was measured to have a 2% performance impact when running `spidermonkey.wasm` with a simple recursive Fibonacci program. It can be used via the `-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime. [here]: WebAssembly/design#1490 (comment)
Hi all: I've been following this one for a while now, and it seems to me to be a real challenge! I'm trying to understand the rationale for mandating the behaviour in the WebAssembly specification came about - regardless on whether there are implementation that depend on the behaviour. Partially OOB store behaviour - introduction of the required behaviour to the spec:Following up how the mandated behaviour was introduced to the specification, I'm not sure of the history, but it feels like there is a case of 'Inception' going on: WebAssembly/spec#438 , WebAssembly/spec#439 , #1025 These seem to form a sort of loop: However, I would hazard that that the uncontroversial clarification is anything but uncontroversial, given the amount of debate this has generated, and that at least three target architectures cannot efficiently implement it? Architecture vs. implementation & probabilistic behavioursIn terms of the (specifically) architectural specs in question, the behaviour in point is architecturally permitted - Whether the behaviour occurs or not isn't just about the particular implementation of the Architecture: The behaviour that is problematic is also likely to be probabilistic dependent on current system state. I also believe that Architecturally, x86 has some complicated cases in which it may be permitted to have similar behaviour (unaligned writes are not guaranteed atomicity, for example) - this is something that we probably need chapter and verse on from Book 1 or 3 of the x86 spec... Architectural litmus test tools like Hurd7 help to check for permitted behaviours across all the mentioned architectures - Notably, however, this particular case is also a challenge to represent in Hurd7 syntax at this moment in time, though as I understand it, it is being worked on. This could be potentially used to check the permitted an actual (it also generates sample code for execution on targets) behaviours across the architectures - but note the probabilistic effect is still also in play for on target tests. Ways forward?Given the apparent history of the introduction of this behaviour to the specification, and the impact on performance that other schemes for approaching it have been demonstrated to have: Is it not reasonable on the balance of the design principals of WebAssembly to revert this tightening of requirements in the face of mounting evidence contrary to the assertion that introduced it? (convoluted sentence there, but the problem is convoluted!) If the answer is that 'it can't be reverted because all use cases rely on the behaviour', then perhaps It does belong. Challenges, and some questions:There are evidently issues when dealing with uses of WebAssembly (Some Cryptocurrency/Smart Contracts) that require deterministic behaviours to remain externally visible after a trap has occurred. As I understand it The WebAssembly core specification defines the behaviour from the design point of view of internal consistency and determinism. At the point of trap, this viewpoint is no longer present, and the rules that apply to it cease to apply - hence it is valid to suggest a 'fix up' of the state for a failed trap on partially out of bounds, but this solution raises the question for me: The requirement for deterministic state in the case of partially OOB writes is from that same external frame of reference, not internal. Hoping that some of this leads to clarity and a good way forwards, many thanks to all on here for the discussion and hard work! |
I may be misreading what you mean here, but to clarify: the core spec specifies the observable behaviour of external invocations of Wasm functions in a given "store". And it specifies and guarantees precisely the state that a Wasm store is in after a trap — a trap only aborts the current computation, it does not invalidate the store it ran in. That means that future external invocations of Wasm functions in the same context that previously trapped still have well-defined behaviour. This is absolutely crucial. Currently, this also is deterministic wrt to partially OOB traps. That could be relaxed, but at least in the deterministic profile it definitely needs to remain so. |
Do you have a view on how profiles would be implemented? Could this be a runtime parameter that is passed during module instantiation? |
@sparker-arm, profiles are not chosen or configured dynamically or on a per-application basis. Profiles are a means to support the diversity of (disjoint) ecosystems that want to include Wasm but cannot support certain features. As such, a profile is selected globally by an ecosystem that supports Wasm. For example, a Wasm infrastructure for embedded devices might not be able to support GC or SIMD. Ecosystems like blockchains running Wasm code need determinism. All applications build for such an ecosystem are then bound by this system-wide choice of profile. |
Right, sorry, I only now realise that you have a proposal for them! I will take a look. |
@rossberg So, what's the status of the proposal? I see you have a PR open to add the deterministic profile. Is the suggestion here to add a deterministic marker to the currently defined semantics of OOB stores? |
The deterministic profile already was integrated into the relaxed SIMD proposal. As for the OOB issue at hand, at least in the deterministic profile, the semantics has to remain as strict as it is right now. There has not been a resolution whether to relax it in general. |
Okay, thanks for the clarification. |
The semantics of partially OOB writes was clarified in #1025. This has proved problematic for runtimes running on Arm, where there is no such guarantee in the architecture. I think all now browsers implement the spec properly, but I'm not sure if any of the wasm-specific non-browser runtimes do.
I could see that having deterministic memory contents is a nice property, but I'm not sure of what the real advantages are, could someone enlighten me?
And do you think there's any room to relax the spec in this regard, possibly even through an extension like relaxed-simd?
The text was updated successfully, but these errors were encountered: