Driver bug triggered in ZeroInitializeWorkgroupMemory #4591
Labels
backend: vulkan
Issues with Vulkan
external: driver-bug
A driver is causing the bug, though we may still want to work around it
Description
Running Vello on an AMD 5700 XT triggers a shader miscompilation at the driver level, which causes incorrect behavior (resulting in device lost).
Repro steps
git clone -b oh_eighteen https://github.com/DJMcNab/vello.git cd vello cargo run -p with_winit
Note: this is PR #398 of linebender/vello. The same thing happens on main, but this branch brings us to wgpu 0.18, and I figured it would be more helpful to work on the most recent versions.
Expected vs observed behavior
The example typically displays a couple frames, sometimes correctly and sometimes corrupted, then exits with a device lost error. Expected behavior is to display a tiger test image and performance statistics.
Extra materials
We tracked this down to a very buggy implementation of ZeroInitializeWorkgroupMemory in the AMD driver; the core problem is that it's zeroing the workgroup-shared memory and then proceeding to user code without a barrier. A secondary problem is that it's doing so extremely inefficiently; it appears all threads are zeroing the entire array.
One of the offending shaders is draw_reduce. The post-processed WGSL is attached, as is the SPIR-V output. Note that the spv does not contain any zeroing logic, as
spv::ZeroInitializeWorkgroupMemoryMode::Native
was selected in adapter.rs.I captured the ISA using Radeon Developer Panel, doing ctrl-A, ctrl-C (and choosing inputs so it would run without crashing so I could capture a trace). Maybe there's a better way to do it, if so please let me know. In any case, three things are wrong:
s_barrier
between the zeroing logic and the user codev_lshlrev_b64
instruction.It makes sense to work around the broken driver by disabling
ZeroInitializeWorkgroupMemoryMode::Native
and also escalate the bug to AMD.amd_bug_files.zip
Platform
Windows 10. AMD Radeon 5700 XT running driver 2.0.233, API version 1.3.217. This is running in Vulkan through the PRIMARY default. With DX12 selected, the example runs but with pathologically slow shader compile times.
The text was updated successfully, but these errors were encountered: