Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU_X] Unit tests failing with "cudaErrorInvalidDeviceFunction: invalid device function" #46864

Open
iarspider opened this issue Dec 4, 2024 · 23 comments

Comments

@iarspider
Copy link
Contributor

Two unit tests - HeterogeneousTest/CUDAKernel/testCudaDeviceAdditionKernel and HeterogeneousTest/CUDAWrapper/testCudaDeviceAdditionWrapper are failing in GPU_X IB since at least CMSSW_15_0_GPU_X_2024-11-27-2300:

  REQUIRE_NOTHROW( cms::cudatest::wrapper_add_vectors_f(in1_d, in2_d, out_d, size) )
due to unexpected exception with message:
  
src/HeterogeneousTest/CUDAWrapper/src/DeviceAdditionWrapper.cu, line 17:
  cudaCheck(cudaGetLastError());
  cudaErrorInvalidDeviceFunction: invalid device function
@iarspider
Copy link
Contributor Author

assign heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2024

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2024

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Dec 4, 2024

on what machines are the tests running ?

@iarspider
Copy link
Contributor Author

Grid node with nVidia gpu:

+ nvidia-smi
Wed Dec  4 00:52:08 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           Off |   00000000:07:00.0 Off |                    0 |
| N/A   35C    P0             26W /  250W |       3MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
GCC version:  gcc version 11.4.1 20231218 (Red Hat 11.4.1-3) (GCC) 

@fwyzard
Copy link
Contributor

fwyzard commented Dec 4, 2024

could you run also cudaComputeCapabilities ?

@makortel
Copy link
Contributor

makortel commented Dec 4, 2024

FWIW, the test has succeeded in 14_2_X (at least between 11-27-2300 and 12-03-2300).

@iarspider
Copy link
Contributor Author

@fwyzard

+ nvidia-smi
Fri Dec  6 07:34:37 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100S-PCIE-32GB          Off |   00000000:07:00.0 Off |                    0 |
| N/A   40C    P0             25W /  250W |       3MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
GCC version:  gcc version 11.4.1 20231218 (Red Hat 11.4.1-3) (GCC) 
+ cudaComputeCapabilities
   0     7.0    Tesla V100S-PCIE-32GB

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2024

What's very curious is that the alpaka-based tests all pass in the IBs 🤔

===== Test "testAlpakaDeviceAdditionKernelCudaAsync" ====
===============================================================================
All tests passed (1048577 assertions in 1 test case)


---> test testAlpakaDeviceAdditionKernelCudaAsync succeeded
TestTime:0
^^^^ End Test testAlpakaDeviceAdditionKernelCudaAsync ^^^^
>> Tests for package HeterogeneousTest/AlpakaKernel ran.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2024

Is there a way to log interactively on a node where the test fails ?
It's hard to debug otherwise :(

@smuzaffar
Copy link
Contributor

@fwyzard , you can do the following to login to the grid gpu node ( where a dummy job is running to hold the node).

ssh lxplus
~cmsbuild/public/lxplus
export _CONDOR_SCHEDD_HOST=bigbird21.cern.ch
export _CONDOR_CREDD_HOST=bigbird21.cern.ch
condor_ssh_to_job -auto-retry 487779.0

Node is available for next 20 hours. Once you logged out of this node then it will be deallocated automatically.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2024

Mhm, it didn't like me, I got kicked out immediately:

lxplus962:~> export _CONDOR_SCHEDD_HOST=bigbird21.cern.ch
lxplus962:~> export _CONDOR_CREDD_HOST=bigbird21.cern.ch
lxplus962:~> condor_ssh_to_job -auto-retry 487779.0
Welcome to [email protected]!
Your condor job is running with pid(s) 3240265 3241835.
b9g47n2106:dir_3240263> Connection to condor-job.b9g47n2106.cern.ch closed by remote host.
Connection to condor-job.b9g47n2106.cern.ch closed.

Can I request a similar slot myself ?

@smuzaffar
Copy link
Contributor

yes, just use condor to request a gpu resource

@smuzaffar
Copy link
Contributor

add the following in the condor job to get gpu

request_GPUs = 1
Requirements = (TARGET.OpSysAndVer =?= "AlmaLinux9")

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2024

OK, I can reproduce the problem.

@fwyzard
Copy link
Contributor

fwyzard commented Jan 16, 2025

Rebuilding CMSSW without the -Wl,--as-needed flag fixes these tests.

@smuzaffar
Copy link
Contributor

So running LD_PRELOAD=libHeterogeneousTestCUDADevice.so ./test/el8_amd64_gcc12/testCudaDeviceAdditionKernel works in 15.0.X. may be there is no strong link dependency on HeterogeneousTestCUDADevice here?

@fwyzard
Copy link
Contributor

fwyzard commented Jan 16, 2025

However

LD_PRELOAD=libHeterogeneousTestCUDADevice.so cmsRun src/HeterogeneousTest/CUDAKernel/test/testCUDATestKernelAdditionModule.py

still fails.

@fwyzard
Copy link
Contributor

fwyzard commented Jan 16, 2025

However, rebuilding just the four HeterogeneousTest/CUDA* packages without -Wl,--as-needed does fix all tests:

Pass    6s ... HeterogeneousTest/CUDAKernel/testCudaDeviceAdditionKernel
Pass    6s ... HeterogeneousTest/CUDAWrapper/testCudaDeviceAdditionWrapper
Pass    6s ... HeterogeneousTest/CUDADevice/testCudaDeviceAddition
Pass    6s ... HeterogeneousTest/CUDAOpaque/testCudaDeviceAdditionOpaque
Pass   16s ... HeterogeneousTest/CUDAOpaque/testCUDATestOpaqueAdditionModule
Pass   16s ... HeterogeneousTest/CUDADevice/testCUDATestDeviceAdditionModule
Pass   16s ... HeterogeneousTest/CUDAKernel/testCUDATestKernelAdditionModule
Pass   16s ... HeterogeneousTest/CUDAWrapper/testCUDATestWrapperAdditionModule
Pass   17s ... HeterogeneousTest/CUDAOpaque/testCUDATestAdditionModules

@fwyzard
Copy link
Contributor

fwyzard commented Jan 16, 2025

The least intrusive changes that I could come up with are

  • when the file static/libPackage_nv.a does not exist, do not link it
  • when the file static/libPackage_nv.a does exist, and a library links -lPackage_nv -lPackage, then do not pass -Wl,--as-needed for those libraries

For example, the snippet

  -lHeterogeneousTestCUDAWrapper_nv \
  -lHeterogeneousTestCUDAKernel_nv \
  -lHeterogeneousTestCUDADevice_nv \
  -lHeterogeneousTestCUDAWrapper \
  -lHeterogeneousTestCUDAKernel \
  -lHeterogeneousTestCUDADevice \
  -lcudart \
  -lcudadevrt \
  -lnvToolsExt \
  -lcuda

should become

  -Wl,--push-state \
  -Wl,--no-as-needed \
  -lHeterogeneousTestCUDAWrapper_nv \
  -lHeterogeneousTestCUDAKernel_nv \
  -lHeterogeneousTestCUDAWrapper \
  -lHeterogeneousTestCUDAKernel \
  -Wl,--pop-state \
  -lHeterogeneousTestCUDADevice \
  -lcudart \
  -lcudadevrt \
  -lnvToolsExt \
  -lcuda

because

  • libHeterogeneousTestCUDADevice_nv.a does not exist, so it should be dropped
  • libHeterogeneousTestCUDAWrapper_nv.a and libHeterogeneousTestCUDAKernel_nv.a exist and are linked, so they and the corresponding non-_nv libraries should not use --as-needed

In my test it was enough to patch the build rules for shared libraries (not tests and not plugins). I don't know if the best approach is to patch them as well anyway ?

Also, I don't really know if this is the proper fix, but at least with these changes all the HeterogeneousTest/CUDA* tests build and pass.

@fwyzard
Copy link
Contributor

fwyzard commented Jan 16, 2025

@smuzaffar what do you think ? do these changes seem reasonable ?

@fwyzard
Copy link
Contributor

fwyzard commented Jan 16, 2025

By the way, looks like we can also add -Wl,-z,noexecstack to suppress these warnings:

/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02872/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/../lib/gcc/x86_64-redhat-linux-gnu/12.3.1/../../../../x86_64-redhat-linux-gnu/bin/ld.bfd: warning: DeviceAdditionKernel.cu_nv.o: missing .note.GNU-stack section implies executable stack
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02872/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/../lib/gcc/x86_64-redhat-linux-gnu/12.3.1/../../../../x86_64-redhat-linux-gnu/bin/ld.bfd: NOTE: This behaviour is deprecated and will be removed in a future version of the linker

So we would use

  -Wl,-z,noexecstack \
  -Wl,--push-state \
  -Wl,--no-as-needed \
  -lHeterogeneousTestCUDAWrapper_nv \
  -lHeterogeneousTestCUDAKernel_nv \
  -lHeterogeneousTestCUDAWrapper \
  -lHeterogeneousTestCUDAKernel \
  -Wl,--pop-state \
  -lHeterogeneousTestCUDADevice \
  -lcudart \
  -lcudadevrt \
  -lnvToolsExt \
  -lcuda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants