Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver bug triggered in ZeroInitializeWorkgroupMemory #4591

Open
raphlinus opened this issue Oct 27, 2023 · 0 comments
Open

Driver bug triggered in ZeroInitializeWorkgroupMemory #4591

raphlinus opened this issue Oct 27, 2023 · 0 comments
Labels
backend: vulkan Issues with Vulkan external: driver-bug A driver is causing the bug, though we may still want to work around it

Comments

@raphlinus
Copy link
Contributor

raphlinus commented Oct 27, 2023

Description
Running Vello on an AMD 5700 XT triggers a shader miscompilation at the driver level, which causes incorrect behavior (resulting in device lost).

Repro steps

git clone -b oh_eighteen https://github.com/DJMcNab/vello.git
cd vello
cargo run -p with_winit

Note: this is PR #398 of linebender/vello. The same thing happens on main, but this branch brings us to wgpu 0.18, and I figured it would be more helpful to work on the most recent versions.

Expected vs observed behavior
The example typically displays a couple frames, sometimes correctly and sometimes corrupted, then exits with a device lost error. Expected behavior is to display a tiger test image and performance statistics.

Extra materials
We tracked this down to a very buggy implementation of ZeroInitializeWorkgroupMemory in the AMD driver; the core problem is that it's zeroing the workgroup-shared memory and then proceeding to user code without a barrier. A secondary problem is that it's doing so extremely inefficiently; it appears all threads are zeroing the entire array.

One of the offending shaders is draw_reduce. The post-processed WGSL is attached, as is the SPIR-V output. Note that the spv does not contain any zeroing logic, as spv::ZeroInitializeWorkgroupMemoryMode::Native was selected in adapter.rs.

I captured the ISA using Radeon Developer Panel, doing ctrl-A, ctrl-C (and choosing inputs so it would run without crashing so I could capture a trace). Maybe there's a better way to do it, if so please let me know. In any case, three things are wrong:

  • There is no s_barrier between the zeroing logic and the user code
  • It appears that all invocations in the workgroup zero the entire array. If this were at the SPIR-V level, the conflicting writes would be considered a data race and thus UB, but maybe at the ISA level the behavior is defined. But this is certainly a performance problem if nothing else.
  • Speaking of performance problems, almost a thousand lines of ISA to zero an array is clearly not a good idea. The code is just bad, among other things repeatedly zeroes v[4:7] using the v_lshlrev_b64 instruction.

It makes sense to work around the broken driver by disabling ZeroInitializeWorkgroupMemoryMode::Native and also escalate the bug to AMD.

amd_bug_files.zip

Platform
Windows 10. AMD Radeon 5700 XT running driver 2.0.233, API version 1.3.217. This is running in Vulkan through the PRIMARY default. With DX12 selected, the example runs but with pathologically slow shader compile times.

@cwfitzgerald cwfitzgerald added external: driver-bug A driver is causing the bug, though we may still want to work around it backend: vulkan Issues with Vulkan labels Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: vulkan Issues with Vulkan external: driver-bug A driver is causing the bug, though we may still want to work around it
Projects
None yet
Development

No branches or pull requests

2 participants