-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU_X] Unit tests failing with "cudaErrorInvalidDeviceFunction: invalid device function" #46864
Comments
assign heterogeneous |
cms-bot internal usage |
A new Issue was created by @iarspider. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
on what machines are the tests running ? |
Grid node with nVidia gpu:
|
could you run also |
FWIW, the test has succeeded in 14_2_X (at least between 11-27-2300 and 12-03-2300). |
|
What's very curious is that the alpaka-based tests all pass in the IBs 🤔
|
Is there a way to log interactively on a node where the test fails ? |
@fwyzard , you can do the following to login to the grid gpu node ( where a dummy job is running to hold the node).
Node is available for next 20 hours. Once you logged out of this node then it will be deallocated automatically. |
Mhm, it didn't like me, I got kicked out immediately:
Can I request a similar slot myself ? |
yes, just use condor to request a gpu resource |
add the following in the condor job to get gpu
|
OK, I can reproduce the problem. |
Rebuilding CMSSW without the |
So running |
However LD_PRELOAD=libHeterogeneousTestCUDADevice.so cmsRun src/HeterogeneousTest/CUDAKernel/test/testCUDATestKernelAdditionModule.py still fails. |
However, rebuilding just the four
|
The least intrusive changes that I could come up with are
For example, the snippet
should become
because
In my test it was enough to patch the build rules for shared libraries (not tests and not plugins). I don't know if the best approach is to patch them as well anyway ? Also, I don't really know if this is the proper fix, but at least with these changes all the |
@smuzaffar what do you think ? do these changes seem reasonable ? |
By the way, looks like we can also add
So we would use
|
Two unit tests - HeterogeneousTest/CUDAKernel/testCudaDeviceAdditionKernel and HeterogeneousTest/CUDAWrapper/testCudaDeviceAdditionWrapper are failing in GPU_X IB since at least CMSSW_15_0_GPU_X_2024-11-27-2300:
The text was updated successfully, but these errors were encountered: