Add GPU sorting article #33

raphlinus · 2024-01-20T02:23:19Z

No description provided.

Analysis of the DeviceRadixSort code including critiques of correctness/portability.

DJMcNab

I've not dug into the literature or any of the code, but some stylistic nits or requests for clarification.
Overall, this seems like a useful collection of links though

Incidentally, this review process exemplifies why I really want to us to adopt the One sentence per line.
Zulip thread: https://xi.zulipchat.com/#narrow/stream/181284-blogging/topic/One.20sentence.20per.20line

content/wiki/gpu/sorting.md

DJMcNab · 2024-01-20T19:59:09Z

content/wiki/gpu/sorting.md

+* Onesweep uses 8 bit digits (so 4 passes for a 32 bit key), while FFX uses 4.
+* Onesweep uses [warp-local multi-split] (WLMS) for ranking, while FFX uses two 2-bit LSD passes.
+
+Both original code bases use subgroups extensively. The FFX implementation works with a subgroup size of 16 or greater, and will produce incorrect results if deployed for a smaller subgroup size. Onesweep depends on a hardcoded subgroup (warp) size of 32 and would be difficult at best to make agile.


Is agile a standard term for this? I don't think it's one I've seen before, but I'm not very into the literature (yet?)

DJMcNab · 2024-01-20T20:02:16Z

content/wiki/gpu/sorting.md

+
+### WebGPU experiment
+
+Raph did an [experiment](https://github.com/googlefonts/compute-shader-101/pull/31) of a hybrid algorithm largely based on FFX, but adapted to WebGPU, and with a version of WLMS. It achieves approximately 1G elements/s on M1 Max.


Similarly I'm not sure the abbreviation 'WLMS' is that helpful. Reading through this once to review, I lost the context of that almost immediately, and it wasn't easy for me to find scanning back again.

It's warp-level multi-split from the Onesweep paper.

Indeed, I got that eventually, but it wasn't easy to context switch it back in. I was only skimming the bullet points, and didn't expect it to appear again.

DJMcNab · 2024-01-20T20:05:04Z

content/wiki/gpu/sorting.md

+
+### WebGPU experiment
+
+Raph did an [experiment](https://github.com/googlefonts/compute-shader-101/pull/31) of a hybrid algorithm largely based on FFX, but adapted to WebGPU, and with a version of WLMS. It achieves approximately 1G elements/s on M1 Max.


How useful is elements/s as a metric?

My naïve understanding is that the sorting speed has to depend on the total number of elements being sorted in one go (edit: as sorting has to be superlinear). Is this one 1 billion item sort per second, or e.g. 100 sorts of 10 million items each.

Of course, I could be misunderstanding here - I've not followed the links thoroughly. In particular, there's room for this being a segmented sort to already explain that

Comparison sorting algorithms are superlinear (n log n is a lower bound), but radix sort is typically linear. I've added a sentence.

content/wiki/gpu/sorting.md

DJMcNab · 2024-01-20T20:12:51Z

content/wiki/gpu/sorting.md

+
+Attempts to push this experiment to 8 bits per digit have not yet yielded sustainable performance improvement.
+
+Several people have pointed out the Onesweep inspired sort from Lichtso, part of the [splatter] Gaussian Splat implementation. However, as discussed in [splatter#2], it is an approximate sort only, and may rely on luck that the particular GPU will process atomics within a subgroup in order. In addition, because it is one-pass, on GPUs without a forward progress guarantee (of which Apple Silicon is especially noticeable), the algorithm may deadlock or experience extended stalls. The experiment mentioned above has neither of these shortcomings. Note that it *does* achieve 8 bits per pass; it is entirely likely that a high performance implementation could draw inspiration from it, more so after subgroups land in WebGPU and thus real WLMS is possible.


In the last sentence, I'm having to guess that 'it' refers to Lichtso, rather than your experiment

content/wiki/gpu/sorting.md

DJMcNab · 2024-01-20T20:19:30Z

content/wiki/gpu/sorting.md

+
+Aras-P has been doing lots of experiments with sorting in his [UnityGaussianSplatting] implementation, all written in the Unity flavor of HLSL. [UnityGaussianSplatting#82] adds something called DeviceRadixSort which shows a modest performance improvement (the discussion thread also speaks to the difficulty of implementing such things portably). This moves to an 8 bit digit. It uses a subgroup-based implementation of WLMS, and appears to make some attempt to be agile in subgroup (wave) size. That said, on code examination it seems likely to fail on subgroup sizes of 64 or above (older AMD cards and Pixel 4, among others).
+
+Another correctness concern is the [lack of the subgroup barrier](https://github.com/aras-p/UnityGaussianSplatting/blob/81a03b6fbddbecd056edeadff124870569b07c11/package/Shaders/DeviceRadixSort.hlsl#L334-L338) between the read of the histogram and its update (by different threads in the same warp). According to the Vulkan memory model, this is a data race and thus undefined behavior. Such a barrier doesn't exist in HLSL, so for strict correctness it would need to be upgraded to a `GroupMemoryBarrierWithGroupSync`, with some performance loss. Another correctness concern in the code is the assumption that `WaveGetLaneCount` will return the same value for different compute shaders on the same GPU, which may not be true on Intel in particular.


I'd probably prefer this to be an issue in the repo we can point to instead, which includes this same code link - it doesn't quite feel right to be "publishing" such a specific complaint, with no avenue for the author to be aware of this or add anything to it once it is resolved.

Yeah, I'll be refactoring this.

DJMcNab · 2024-01-20T20:21:45Z

content/wiki/gpu/sorting.md

+
+## Bitonic sort
+
+[Bitonic sort] is often proposed as it is conceptually fairly simple the parallelism is easy to exploit, but when applied to large problems it is clear that the number of passes is unacceptably large; typically in the dozens where a radix sort would have 4 or 8.


in "is often proposed as it is conceptually fairly simple the parallelism is easy to exploit" I think there's a missing conjunctive of some kind

DJMcNab · 2024-01-20T20:22:35Z

content/wiki/gpu/sorting.md

+
+[Forma] has a sorting implementation called [conveyor_sort]. This is a merge sort and is in vanilla WebGPU. Performance has not been characterized yet.
+
+[segmented sort]: https://moderngpu.github.io/segsort.html


I'd probably prefer have these be in order of appearance, but if we don't have automated tooling for that, then there's no need to.

xorgy · 2024-01-20T20:30:25Z

content/wiki/gpu/sorting.md

+
+* Onesweep uses a single pass scan for the digit histograms, while FFX uses a traditional multi-dispatch tree reduction approach.
+* Onesweep uses 8 bit digits (so 4 passes for a 32 bit key), while FFX uses 4.
+* Onesweep uses [warp-local multi-split] (WLMS) for ranking, while FFX uses two 2-bit LSD passes.


WLMS is warp-level multi-split, rather than warp-local multi-split, in the Onesweep paper.

I always get this confused, thanks for checking.

A number of stylistic changes, and a rework of critiques of DeviceRadixSort; those have been moved to the PR adding them to UnityGaussianSplatting. In addition, there's a link to the Zulip thread.

Add a sentence explaining the current state of DeviceRadixSort with respect to portability to different subgroup sizes.

raphlinus added 2 commits January 19, 2024 18:22

Add GPU sorting article

605b170

A bit more detail on DeviceRadixSort

f3606bc

Analysis of the DeviceRadixSort code including critiques of correctness/portability.

DJMcNab approved these changes Jan 20, 2024

View reviewed changes

xorgy reviewed Jan 20, 2024

View reviewed changes

raphlinus mentioned this pull request Jan 21, 2024

Add DeviceRadixSort aras-p/UnityGaussianSplatting#82

Merged

raphlinus added 2 commits January 21, 2024 09:10

Address review feedback

727f7a8

A number of stylistic changes, and a rework of critiques of DeviceRadixSort; those have been moved to the PR adding them to UnityGaussianSplatting. In addition, there's a link to the Zulip thread.

DeviceRadixSort subgroup size portability

bbdc723

Add a sentence explaining the current state of DeviceRadixSort with respect to portability to different subgroup sizes.

raphlinus merged commit db5ea6b into main Jan 21, 2024
1 check passed

raphlinus deleted the gpu_sorting branch January 21, 2024 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU sorting article #33

Add GPU sorting article #33

raphlinus commented Jan 20, 2024

DJMcNab left a comment

DJMcNab Jan 20, 2024

DJMcNab Jan 20, 2024

xorgy Jan 20, 2024

DJMcNab Jan 20, 2024

DJMcNab Jan 20, 2024 •

edited

Loading

raphlinus Jan 21, 2024

DJMcNab Jan 20, 2024

DJMcNab Jan 20, 2024

raphlinus Jan 21, 2024

DJMcNab Jan 20, 2024

DJMcNab Jan 20, 2024

xorgy Jan 20, 2024

raphlinus Jan 21, 2024


		### WebGPU experiment

		Raph did an [experiment](https://github.com/googlefonts/compute-shader-101/pull/31) of a hybrid algorithm largely based on FFX, but adapted to WebGPU, and with a version of WLMS. It achieves approximately 1G elements/s on M1 Max.


		Attempts to push this experiment to 8 bits per digit have not yet yielded sustainable performance improvement.

		Several people have pointed out the Onesweep inspired sort from Lichtso, part of the [splatter] Gaussian Splat implementation. However, as discussed in [splatter#2], it is an approximate sort only, and may rely on luck that the particular GPU will process atomics within a subgroup in order. In addition, because it is one-pass, on GPUs without a forward progress guarantee (of which Apple Silicon is especially noticeable), the algorithm may deadlock or experience extended stalls. The experiment mentioned above has neither of these shortcomings. Note that it does achieve 8 bits per pass; it is entirely likely that a high performance implementation could draw inspiration from it, more so after subgroups land in WebGPU and thus real WLMS is possible.


		Aras-P has been doing lots of experiments with sorting in his [UnityGaussianSplatting] implementation, all written in the Unity flavor of HLSL. [UnityGaussianSplatting#82] adds something called DeviceRadixSort which shows a modest performance improvement (the discussion thread also speaks to the difficulty of implementing such things portably). This moves to an 8 bit digit. It uses a subgroup-based implementation of WLMS, and appears to make some attempt to be agile in subgroup (wave) size. That said, on code examination it seems likely to fail on subgroup sizes of 64 or above (older AMD cards and Pixel 4, among others).

		Another correctness concern is the [lack of the subgroup barrier](https://github.com/aras-p/UnityGaussianSplatting/blob/81a03b6fbddbecd056edeadff124870569b07c11/package/Shaders/DeviceRadixSort.hlsl#L334-L338) between the read of the histogram and its update (by different threads in the same warp). According to the Vulkan memory model, this is a data race and thus undefined behavior. Such a barrier doesn't exist in HLSL, so for strict correctness it would need to be upgraded to a `GroupMemoryBarrierWithGroupSync`, with some performance loss. Another correctness concern in the code is the assumption that `WaveGetLaneCount` will return the same value for different compute shaders on the same GPU, which may not be true on Intel in particular.


		## Bitonic sort

		[Bitonic sort] is often proposed as it is conceptually fairly simple the parallelism is easy to exploit, but when applied to large problems it is clear that the number of passes is unacceptably large; typically in the dozens where a radix sort would have 4 or 8.


		[Forma] has a sorting implementation called [conveyor_sort]. This is a merge sort and is in vanilla WebGPU. Performance has not been characterized yet.

		[segmented sort]: https://moderngpu.github.io/segsort.html

Add GPU sorting article #33

Add GPU sorting article #33

Conversation

raphlinus commented Jan 20, 2024

DJMcNab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DJMcNab Jan 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DJMcNab Jan 20, 2024 •

edited

Loading