Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why there is no GPU resource allocatable on a GPU cloud instance #834

Closed
shizhouhu opened this issue Jul 19, 2024 · 7 comments
Closed

Why there is no GPU resource allocatable on a GPU cloud instance #834

shizhouhu opened this issue Jul 19, 2024 · 7 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@shizhouhu
Copy link

when i describe node, there is no gpu resource, why?

Capacity:
  cpu:                48
  ephemeral-storage:  574137520Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263603720Ki
  pods:               110
Allocatable:
  cpu:                48
  ephemeral-storage:  529125137556
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263501320Ki
  pods:               110

(this is the node description)

  1. I have installed nvidia driver
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       Off | 00000000:86:00.0 Off |                    0 |
| N/A   28C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P4                       Off | 00000000:87:00.0 Off |                    0 |
| N/A   29C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla P4                       Off | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla P4                       Off | 00000000:D8:00.0 Off |                    0 |
| N/A   31C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

(this is nvidia driver for tesla p4)

  1. I have installed nvidia container toolkit, and configured the runtime as containerd
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

(this is the containerd config for nvidia container runtime)

3.I have installed nvidia k8s plugin nvidia-device-plugin

NAMESPACE      NAME                                      READY   STATUS    RESTARTS      AGE
kube-flannel   kube-flannel-ds-x2pzs                     1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-2k9mg                  1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-nr6tz                  1/1     Running   2 (16h ago)   7d18h
kube-system    etcd-ubuntu-2288h-v5                      1/1     Running   3 (16h ago)   7d18h
kube-system    kube-apiserver-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    kube-controller-manager-ubuntu-2288h-v5   1/1     Running   3 (16h ago)   7d18h
kube-system    kube-proxy-p6gk9                          1/1     Running   2 (16h ago)   7d18h
kube-system    kube-scheduler-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    metrics-server-6875467c8d-k6sd6           1/1     Running   2 (16h ago)   2d15h
kube-system    nvidia-device-plugin-daemonset-57kxg      1/1     Running   0             10h

(this is the nvidia device plugin for k8s)

does anyone know the problem? thanks.

@jaffe-fly
Copy link

Having the same problem

@jaffe-fly
Copy link

you need install GFD or label you node

@Bugaoxingxx
Copy link

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

@shizhouhu
Copy link
Author

you need install GFD or label you node

thanks, will try

@shizhouhu
Copy link
Author

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

thanks

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2024
Copy link

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants