-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KernelAbstractions support #147
base: master
Are you sure you want to change the base?
Conversation
Ah, I guess while I'm here, I'll briefly explain the differences with CUDA syntactically:
The tricky thing about this PR was removing the CUDA dependency outside of the kernels. There is still one call in zygote.jl I gotta figure out: https://github.com/leios/Molly.jl/blob/KA_support/src/zygote.jl#L698 |
Great work so far. Making the code compatible with generic array types is a nice change, and having the kernels work on different devices would be a selling point of the package. I would be interested to see the performance of the kernels compared to the CUDA versions. Also whether it plays nicely with Enzyme. Good luck with the runtime errors. |
I think I can finish this up today or else early next week (emphasis on think), but to quickly answer the questions:
|
Great. Not urgent, but how well does KernelAbstractions.jl deal with warp-level code, e.g. |
That's a good question. We can probably expose the APIs available from CUDA, but I am not sure how AMDGPU deals with these. We would also just need to figure out what that corresponds to on parallel CPU. I think these are the tools we need: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html Ah, as an important note (that I somehow failed to mention before), an advantage of KA is that it also provides a parallel CPU implentation, so the kernels can be written once and executed everywhere. I didn't do that in this PR because that brings up design questions related to Molly internals. |
Yeah we can discuss that after this PR. I would be okay with switching if there was no performance hit. Judging from discussion on the linked PR there is not currently warp support in KA. It may be necessary to leave that CUDA kernel in and have a separate KA kernel for other backends until warp support comes to KA. |
Ok, so a couple of quick notes here:
|
That is okay.
I wouldn't worry about this. Currently I only merge stuff that I am able to maintain, or where I think I can skill up to the point of maintaining it. The changes here seem reasonable and worth merging once any errors and performance regressions are fixed. There is a wider question about whether KernelAbstractions.jl will continue to be maintained compared to CUDA.jl, but it seems to have decent traction now. |
Yeah, the plan is for KA to be used even within GPUArrays, so it's not going anywhere anytime soon. Speaking of which, the "correct" course of action for KA in Molly would be to get the KA in GPUArrays first and then use that to implement any missing features on the GPUArrays level. Would it be better for me to separate this PR then? Maybe one doing the generic Array stuff and then another with the KA support? |
I would try and get this PR working as is. Only if that becomes difficult would it be worth splitting out and merging the generic array support. If KA is here for the long haul then there is a benefit to switching the kernels even if only CUDA works currently. Because then when changes happen elsewhere, AMDGPU will work without any changes required in Molly. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #147 +/- ##
==========================================
+ Coverage 74.32% 74.99% +0.66%
==========================================
Files 35 38 +3
Lines 5028 4998 -30
==========================================
+ Hits 3737 3748 +11
+ Misses 1291 1250 -41 ☔ View full report in Codecov by Sentry. |
Getting around to this and noticed a bunch of segfaults in the CPU tests. I then found that there's a strange conflict between
But only if
st:
Note that using a single thread "fixes" the issue. It seems to be a UCX / MPI issue, but I am not loading them and neither are in the Manifest. |
This looks exactly like https://juliaparallel.org/MPI.jl/stable/knownissues/#Multi-threading-and-signal-handling What is |
The fix mentioned there seems to work:
Note
|
Wild... What is |
Is it a linux thing like libpthread? |
# This triggers an error but it isn't printed | ||
# See https://discourse.julialang.org/t/error-handling-in-cuda-kernels/79692 | ||
# for how to throw a more meaningful error | ||
error("wrong force unit returned, was expecting $F but got $(unit(f[1]))") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interpolation here is particularly tricky. I would avoid that if at all possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: are you referring to the error(...)
call in an inlined function within an @kernel
? Or carrying the units through to this stage in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error("Oops")
is fine, error("Oops, $F")
is sadly not since it string interpolation is really tough on the GPU compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove that, no problem.
Right, having an issue right now with zygote tests. Not really sure hot to go about debugging it, so I'll paste it here and then think about it
|
As a note, it looks like #182 is close-ish to being done? If so, it might be best to rework this PR once those changes are in. As it stands, this PR can work as a branch for any non-CUDA GPU (As long as the user does not need differentiable sims). |
Thanks for keeping going on this. Yes #182 will get merged soon, I was waiting on full Enzyme GPU broadcasting support to avoid GPU differentiable simulations breaking, but I might just merge it and wait for that to arrive later. Consequently I wouldn't worry about the Zygote error. I'm happy to help update this PR after #182, I realise that the ground has been moving under you. |
Yeah, no worries. I should go ahead and close the #99 PR since this one supersedes it. On my end, I am happy to wait a little longer and rebase up when you are happy with #182. I have other projects to work on in the mean time. Could you link the Enzyme issue? Is it this one? EnzymeAD/Enzyme.jl#1672 |
Okay great. The example I posted on JuliaGPU/CUDA.jl#2471 doesn't work as it stands, that and similar issues around reduction and broadcasting are the problem. |
@jgreener64 if there isn't an open issue on Enzyme.jl associated with it it is very likely that Billy and I will lose track of it. |
Billy has favoured moving CUDA-specific issues to CUDA.jl, e.g. EnzymeAD/Enzyme.jl#1454 (comment), in either case I can find some MWEs in the next few days of what is erroring and make sure they are tracked in issues. |
#182 is now merged. |
Particular issue is this one: JuliaGPU/CUDA.jl#2455 Which is essentially that we haven't yet implemented the CUDA.jl reduction of broadcast derivative (just added the reduction of array function in cuda.jl) |
Coming back to this. AMD tests pass up to the end of the simulation tests. Jotting this down for later because it's a precision issue that might not be easy to solve:
|
Great, let me know when this is ready for review. Don't worry about the gradient tests for now, I am currently dealing with some issues there. Do you know what effect these changes have on performance, e.g. for the protein system at https://juliamolsim.github.io/Molly.jl/dev/documentation/#Simulating-a-protein on GPU with no barostat? |
I can take a look at the difference in performance as well, but it should be the same for CUDA devices because all of that is now in a separate extension. |
Ah I see. Would it cause slowdown to remove the CUDA code entirely and just use the KA code? |
The answer should be "no," but there are certain features that are supported by CUDA, but not by KA, such as the warp-level semantics used for When we swapped over to KA for GPUArrays, there was not a noticeable difference in performance for the kernels there. |
Great, in that case maybe remove all the CUDA code from the extension except the code that uses the warps. |
Ok, will do. Still the following left:
|
Ok, the CUDA tests were all passing locally just a second ago. Note that I had to leave both the nonl and nl kernels in the CUDA extension. |
Ok, I still need to do the performance checks, but the PR looks to finally be out of the draft stage. A few important notes before merging:
|
The GPU error from the gradient tests:
|
Right, so the tests pass locally, but running the KA version of the Note that this does not affect AMD (my testing machine) and doesn't affect CUDA execution because the Here is the error:
|
That's strange, but I wouldn't worry as we are working to change that kernel anyway. |
98134c8
to
2a26c03
Compare
Not sure why I am getting precision issues on the CPU tests after rebasing up. Also, I have some bad news about performance and would actually recommend against merging this pr at this time because this branch appears to be 30% slower. I am not sure why that is the case at this time, but this branch seems to have a larger number of allocations and memory footprint. Details below: Master:
This branch:
Average of 10 runs
Script:
|
Good news. I'm an idiot. I seem to have run the tests on two different GPUs before. When running on the same GPU, the performance is the same: All runs:
Averages:
So now we just need to:
|
The KernelAbstractions branch now compiles, so I thought I would put forward a quick draft PR while I figure out all the runtime bugs.
Notes:
CUDA
has been removed and replaced withKernelAbstractions
andGPUArrays
. As an important note here, GPUArrays is not strictly necessary except to replicate the behavior of the boolean GPU flag (ieisa(a, AbstractGPUArray)
).