Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU sorting article #33

Merged
merged 4 commits into from
Jan 21, 2024
Merged

Add GPU sorting article #33

merged 4 commits into from
Jan 21, 2024

Conversation

raphlinus
Copy link
Contributor

No description provided.

Analysis of the DeviceRadixSort code including critiques of correctness/portability.
Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not dug into the literature or any of the code, but some stylistic nits or requests for clarification.
Overall, this seems like a useful collection of links though

Incidentally, this review process exemplifies why I really want to us to adopt the One sentence per line.
Zulip thread: https://xi.zulipchat.com/#narrow/stream/181284-blogging/topic/One.20sentence.20per.20line

content/wiki/gpu/sorting.md Outdated Show resolved Hide resolved
* Onesweep uses 8 bit digits (so 4 passes for a 32 bit key), while FFX uses 4.
* Onesweep uses [warp-local multi-split] (WLMS) for ranking, while FFX uses two 2-bit LSD passes.

Both original code bases use subgroups extensively. The FFX implementation works with a subgroup size of 16 or greater, and will produce incorrect results if deployed for a smaller subgroup size. Onesweep depends on a hardcoded subgroup (warp) size of 32 and would be difficult at best to make agile.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is agile a standard term for this? I don't think it's one I've seen before, but I'm not very into the literature (yet?)


### WebGPU experiment

Raph did an [experiment](https://github.com/googlefonts/compute-shader-101/pull/31) of a hybrid algorithm largely based on FFX, but adapted to WebGPU, and with a version of WLMS. It achieves approximately 1G elements/s on M1 Max.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I'm not sure the abbreviation 'WLMS' is that helpful. Reading through this once to review, I lost the context of that almost immediately, and it wasn't easy for me to find scanning back again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's warp-level multi-split from the Onesweep paper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I got that eventually, but it wasn't easy to context switch it back in. I was only skimming the bullet points, and didn't expect it to appear again.


### WebGPU experiment

Raph did an [experiment](https://github.com/googlefonts/compute-shader-101/pull/31) of a hybrid algorithm largely based on FFX, but adapted to WebGPU, and with a version of WLMS. It achieves approximately 1G elements/s on M1 Max.
Copy link
Member

@DJMcNab DJMcNab Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How useful is elements/s as a metric?

My naïve understanding is that the sorting speed has to depend on the total number of elements being sorted in one go (edit: as sorting has to be superlinear). Is this one 1 billion item sort per second, or e.g. 100 sorts of 10 million items each.

Of course, I could be misunderstanding here - I've not followed the links thoroughly. In particular, there's room for this being a segmented sort to already explain that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparison sorting algorithms are superlinear (n log n is a lower bound), but radix sort is typically linear. I've added a sentence.

content/wiki/gpu/sorting.md Outdated Show resolved Hide resolved

Attempts to push this experiment to 8 bits per digit have not yet yielded sustainable performance improvement.

Several people have pointed out the Onesweep inspired sort from Lichtso, part of the [splatter] Gaussian Splat implementation. However, as discussed in [splatter#2], it is an approximate sort only, and may rely on luck that the particular GPU will process atomics within a subgroup in order. In addition, because it is one-pass, on GPUs without a forward progress guarantee (of which Apple Silicon is especially noticeable), the algorithm may deadlock or experience extended stalls. The experiment mentioned above has neither of these shortcomings. Note that it *does* achieve 8 bits per pass; it is entirely likely that a high performance implementation could draw inspiration from it, more so after subgroups land in WebGPU and thus real WLMS is possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the last sentence, I'm having to guess that 'it' refers to Lichtso, rather than your experiment

content/wiki/gpu/sorting.md Outdated Show resolved Hide resolved

Aras-P has been doing lots of experiments with sorting in his [UnityGaussianSplatting] implementation, all written in the Unity flavor of HLSL. [UnityGaussianSplatting#82] adds something called DeviceRadixSort which shows a modest performance improvement (the discussion thread also speaks to the difficulty of implementing such things portably). This moves to an 8 bit digit. It uses a subgroup-based implementation of WLMS, and appears to make some attempt to be agile in subgroup (wave) size. That said, on code examination it seems likely to fail on subgroup sizes of 64 or above (older AMD cards and Pixel 4, among others).

Another correctness concern is the [lack of the subgroup barrier](https://github.com/aras-p/UnityGaussianSplatting/blob/81a03b6fbddbecd056edeadff124870569b07c11/package/Shaders/DeviceRadixSort.hlsl#L334-L338) between the read of the histogram and its update (by different threads in the same warp). According to the Vulkan memory model, this is a data race and thus undefined behavior. Such a barrier doesn't exist in HLSL, so for strict correctness it would need to be upgraded to a `GroupMemoryBarrierWithGroupSync`, with some performance loss. Another correctness concern in the code is the assumption that `WaveGetLaneCount` will return the same value for different compute shaders on the same GPU, which may not be true on Intel in particular.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably prefer this to be an issue in the repo we can point to instead, which includes this same code link - it doesn't quite feel right to be "publishing" such a specific complaint, with no avenue for the author to be aware of this or add anything to it once it is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll be refactoring this.


## Bitonic sort

[Bitonic sort] is often proposed as it is conceptually fairly simple the parallelism is easy to exploit, but when applied to large problems it is clear that the number of passes is unacceptably large; typically in the dozens where a radix sort would have 4 or 8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in "is often proposed as it is conceptually fairly simple the parallelism is easy to exploit" I think there's a missing conjunctive of some kind


[Forma] has a sorting implementation called [conveyor_sort]. This is a merge sort and is in vanilla WebGPU. Performance has not been characterized yet.

[segmented sort]: https://moderngpu.github.io/segsort.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably prefer have these be in order of appearance, but if we don't have automated tooling for that, then there's no need to.


* Onesweep uses a single pass scan for the digit histograms, while FFX uses a traditional multi-dispatch tree reduction approach.
* Onesweep uses 8 bit digits (so 4 passes for a 32 bit key), while FFX uses 4.
* Onesweep uses [warp-local multi-split] (WLMS) for ranking, while FFX uses two 2-bit LSD passes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WLMS is warp-level multi-split, rather than warp-local multi-split, in the Onesweep paper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always get this confused, thanks for checking.

A number of stylistic changes, and a rework of critiques of DeviceRadixSort; those have been moved to the PR adding them to UnityGaussianSplatting. In addition, there's a link to the Zulip thread.
Add a sentence explaining the current state of DeviceRadixSort with respect to portability to different subgroup sizes.
@raphlinus raphlinus merged commit db5ea6b into main Jan 21, 2024
1 check passed
@raphlinus raphlinus deleted the gpu_sorting branch January 21, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants