Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help Needed: NVIDIA Docker Error - libnvidia-ml.so.1 Not Found in Container #848

Open
4833R11Y45 opened this issue Jan 7, 2025 · 0 comments

Comments

@4833R11Y45
Copy link

Hi everyone,
I’ve been struggling with an issue while trying to run Docker containers with GPU support on my Ubuntu 24.04 system. Despite following all the recommended steps, I keep encountering the following error when running a container with the NVIDIA runtime:
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Here’s a detailed breakdown of my setup and the troubleshooting steps I’ve tried so far:

System Details:

OS: Ubuntu 24.04
GPU: NVIDIA L4
Driver Version: 535.183.01
CUDA Version (Driver): 12.2
NVIDIA Container Toolkit Version: 1.17.3
Docker Version: Latest stable version from Docker’s official repository.

What I’ve Tried:

Verified NVIDIA Driver Installation:

nvidia-smi works perfectly and shows the GPU details.
The driver version is compatible with CUDA 12.2.

Reinstalled NVIDIA Container Toolkit:

Followed the official NVIDIA guide to install and configure the NVIDIA Container Toolkit.
Reinstalled it multiple times using:
sudo apt-get install --reinstall -y nvidia-container-toolkit
sudo systemctl restart docker

Verified the installation with nvidia-container-cli info, which outputs the correct details about the GPU.

Checked for libnvidia-ml.so.1:

The library exists on the host system at /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1.
Verified its presence using:
find /usr -name libnvidia-ml.so.1

Tried Running Different CUDA Images:

Tried running containers with various CUDA versions:
docker run --rm --gpus all nvidia/cuda:12.2.0-runtime-ubuntu22.04 nvidia-smi
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Both fail with the same error:
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Manually Mounted NVIDIA Libraries:

Tried explicitly mounting the directory containing libnvidia-ml.so.1 into the container:
docker run --rm --gpus all -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu nvidia/cuda:12.2.0-runtime-ubuntu22.04 nvidia-smi

Still encountered the same error.

Checked NVIDIA Container Runtime Logs:

Enabled debugging in /etc/nvidia-container-runtime/config.toml and checked the logs:
cat /var/log/nvidia-container-toolkit.log
cat /var/log/nvidia-container-runtime.log

The logs show that the NVIDIA runtime is initializing correctly, but the container fails to load libnvidia-ml.so.1.

Reinstalled NVIDIA Drivers:

Reinstalled the NVIDIA drivers using:
sudo ubuntu-drivers autoinstall
sudo reboot

Verified the installation with nvidia-smi, which works fine.

Tried Prebuilt NVIDIA Base Images:

Attempted to use a prebuilt NVIDIA base image:
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

Still encountered the same error.

Logs and Observations:

The NVIDIA container runtime seems to detect the GPU and initialize correctly.
The error consistently points to libnvidia-ml.so.1 not being found inside the container, even though it exists on the host system.
The issue persists across different CUDA versions and container images.

Questions:

Why is the NVIDIA container runtime unable to mount libnvidia-ml.so.1 into the container, even though it exists on the host system?
Is this a compatibility issue with Ubuntu 24.04, the NVIDIA drivers, or the NVIDIA Container Toolkit?
Has anyone else faced a similar issue, and how did you resolve it?

I’ve spent hours troubleshooting this and would greatly appreciate any insights or suggestions. Thanks in advance for your help!

TL;DR:
Getting libnvidia-ml.so.1 not found error when running Docker containers with GPU support on Ubuntu 24.04. Tried reinstalling drivers, NVIDIA Container Toolkit, and manually mounting libraries, but the issue persists. Need help resolving this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant