-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exposing warp-level semantics #420
Comments
There currently is no support in KA for wavefront/warp level programming. Two immediate questions:
If the goal is to expose warp level reduce operations, maybe we can get away with defining a workgroup x-ref: #419 |
I'm struggling to find much at all on warp-level semantics for metal or even OneAPI. It seems like OpenCL just ignores it(?): https://stackoverflow.com/questions/42259118/is-there-any-guarantee-that-all-of-threads-in-wavefront-opencl-always-synchron To be honest, I haven't seen an application that really needs the Here's a question I don't have an answer to. Do other (non-NVIDIA) cords even need Do other architectures (Intel, AMD, Metal), even allow this? I guess they might in the future if they don't already. That means for the short term that we would need CUDA-specific tooling where |
Coming back to my question: What's the reason you want to access this functionality? Generally speaking I don't think warpsize is something we should expose in KA, but there are of course workgroup operations we are missing. Reduction is the core one. #421 is introducing the notion of a subgroup, but I want to understand the reasoning behind that better. Exposing functionality for one backend only has the risk that the user writes a let el that is actually not portable. |
Reading through pages such as https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/sub-groups-and-simd-vectorization.html and https://intel.github.io/llvm-docs/cuda/opencl-subgroup-vs-cuda-crosslane-op.html, I was under the impression that subgroups do map pretty closely to warps/wavefronts? If that's the case, then having a cross-platform abstraction for working with them seems useful. |
KernelAbstractions is not One API, so the meaning of subgroup needs to be defined clearly and independently. If often comes down to can we expose these semantics without to much of a performance loss on other hardware? Users are always free to use CUDA.jl directly, but writing a kernel should have a reasonable expectation of performance across all backends. KernelAbstractions is a common denominator not a superset of behavior. |
I would guess the outlier here is Metal (and parallel CPU) then? I think AMD (wavefronts), CUDA (warps), and Intel (subgroups) all have some concept of warp-level operations; however, I agree with @vchuravy here. None of the warp-level semantics seem standardized enough to put them into KA at this time. What is the plan with #421, though? I mean, if it's already introducing a subgroup, I guess we can use that for the other backends? On my end, I was trying to do a simple port of: JuliaMolSim/Molly.jl#133 so we could completely remove the CUDA backend. |
For the CPU I had long hoped to use SIMD.jl or a compiler pass to perform vectorization. Would a subgroupsize of 1 be legal? |
Yes, which is why I found the second link interesting. Digging around a bit more turned up some pages from the SYCL spec (1, 2, 3) which appears to be trying to standardize this. I have no idea how integration on the AMD and Nvidia side works (if at all), but perhaps it could serve as inspiration for creating a common denominator interface in KA. |
It also looks like vulkan is trying to standardize the terminology as well: https://www.khronos.org/blog/vulkan-subgroup-tutorial. Their API is supposed to be similar to OpenCL for compute, but I cannot find such topics in OpenCL. For me, I can obviously see a use for warp reduce, scan, etc. I also find myself wanting to get It's just that
But I don't think it will provide any wrong results on any of these platforms if there are dummy calls. I mean the |
It's kind of hidden away and took a while for me to find, but the OpenCL spec does touch on sub-groups (they like the hyphen) in a few places. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_mapping_work_items_onto_an_ndrange introduces them and subsequent sections like https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#execution-model-sync look relevant. There's also some more info about the actual kernel-level API in the OpenCL C spec, e.g. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#subgroup-functions. |
I think having some "subgroup sync" op would be helpful (it could fall back on a full sync if not) |
I tend to think of "warps on a CPU" as the SIMD vector size. The semantics are quite similar:
It might also make sense to let the caller specify (statically) the SIMD vector size for a kernel and pass this as optimization hint to the compiler. Alternatively, the compiler could choose a SIMD vector size statically (depending on the target CPU capabilities). I understand that the match to "warp size" isn't perfect since the SIMD vector size depends on the type (64bit vs 32bit). |
I had a request from a user to use warp-level semantics from CUDA:
sync_warp
,warpsize
, and stuff here: https://cuda.juliagpu.org/stable/api/kernel/#Warp-level-functions.They seem to be available here: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html, but I don't know where they exist in AMDGPU.jl or how to use them in KA.
They might be available, but I couldn't find "warp" or "wavefront" or anything else in either the AMDGPU or KernelAbstractions docs. I mean, there was this page: https://amdgpu.juliagpu.org/stable/wavefront_ops/ ... but it's a bit sparse ^^
If this is already available in KA, I'm happy to add a bit to the docs explaining how they are used. If it is not available, I guess I need to put some PRs forward for CUDA(kernels), ROC(kernels), and here with the new syntax.
Related discussion: JuliaMolSim/Molly.jl#147
Putting it here because I think I found kinda what I was looking for for AMDGPU here: https://github.com/JuliaGPU/AMDGPU.jl/blob/master/test/device/wavefront.jl
wavefrontsize
= warpsizewfred
= wavefront reducewfscan
= wafecron scanwfany
= ???wfail
= ???wfsame
= ???warp_sync
The text was updated successfully, but these errors were encountered: