Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H100 gpus do not reach 100% #114

Open
kk0nrad opened this issue Oct 25, 2024 · 2 comments
Open

H100 gpus do not reach 100% #114

kk0nrad opened this issue Oct 25, 2024 · 2 comments

Comments

@kk0nrad
Copy link

kk0nrad commented Oct 25, 2024

I'm trying to test a bunch of H100 gpus, but I am unable to reach 100% of utilization.

root@ainode01:~# nvidia-smi
Fri Oct 25 11:20:13 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 35C P0 134W / 700W | 72790MiB / 81559MiB | 2% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 109W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 30C P0 114W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 35C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 37C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 32C P0 110W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 34C P0 110W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 30C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 272544 C ./gpu_burn 72780MiB |
| 1 N/A N/A 272720 C ./gpu_burn 72780MiB |
| 2 N/A N/A 272722 C ./gpu_burn 72780MiB |
| 3 N/A N/A 272724 C ./gpu_burn 72780MiB |
| 4 N/A N/A 272726 C ./gpu_burn 72780MiB |
| 5 N/A N/A 272728 C ./gpu_burn 72780MiB |
| 6 N/A N/A 272730 C ./gpu_burn 72780MiB |
| 7 N/A N/A 272732 C ./gpu_burn 72780MiB |
+-----------------------------------------------------------------------------------------+
root@ainode01:~#

some seconds later:

root@ainode01:~# nvidia-smi
Fri Oct 25 11:21:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 39C P0 145W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 140W / 700W | 72790MiB / 81559MiB | 4% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 34C P0 140W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 41C P0 138W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 41C P0 148W / 700W | 72790MiB / 81559MiB | 11% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 35C P0 133W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 37C P0 133W / 700W | 72790MiB / 81559MiB | 2% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 32C P0 135W / 700W | 72790MiB / 81559MiB | 11% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 272544 C ./gpu_burn 72780MiB |
| 1 N/A N/A 272720 C ./gpu_burn 72780MiB |
| 2 N/A N/A 272722 C ./gpu_burn 72780MiB |
| 3 N/A N/A 272724 C ./gpu_burn 72780MiB |
| 4 N/A N/A 272726 C ./gpu_burn 72780MiB |
| 5 N/A N/A 272728 C ./gpu_burn 72780MiB |
| 6 N/A N/A 272730 C ./gpu_burn 72780MiB |
| 7 N/A N/A 272732 C ./gpu_burn 72780MiB |
+-----------------------------------------------------------------------------------------+
root@ainode01:~#

what am I missing?

@kk0nrad kk0nrad changed the title H100 gpus does not reach 100% H100 gpus do not reach 100% Oct 25, 2024
@yamakenjp
Copy link

im using 100%

ubuntu@test:~$ nvidia-smi
Fri Nov  8 17:28:14 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:18:00.0 Off |                    0 |
| N/A   46C    P0            698W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:2A:00.0 Off |                    0 |
| N/A   46C    P0            698W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:3A:00.0 Off |                    0 |
| N/A   51C    P0            702W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
| N/A   45C    P0            698W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9A:00.0 Off |                    0 |
| N/A   44C    P0            700W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:AB:00.0 Off |                    0 |
| N/A   47C    P0            699W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:BA:00.0 Off |                    0 |
| N/A   49C    P0            700W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
| N/A   46C    P0            699W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      7445      C   ./gpu_burn                                  72780MiB |
|    1   N/A  N/A      7479      C   ./gpu_burn                                  72780MiB |
|    2   N/A  N/A      7482      C   ./gpu_burn                                  72780MiB |
|    3   N/A  N/A      7484      C   ./gpu_burn                                  72780MiB |
|    4   N/A  N/A      7486      C   ./gpu_burn                                  72780MiB |
|    5   N/A  N/A      7488      C   ./gpu_burn                                  72780MiB |
|    6   N/A  N/A      7490      C   ./gpu_burn                                  72780MiB |
|    7   N/A  N/A      7492      C   ./gpu_burn                                  72780MiB |
+-----------------------------------------------------------------------------------------+

compile option > make COMPUTE=90

./gpu_burn -d 3600

@kk0nrad
Copy link
Author

kk0nrad commented Nov 10, 2024

It turned out to be a problem with the license server: nvidia-smi -q|grep -i lic pointed me the right direction (saying the hardware was unregistered): after a reboot, the gpus work for a while. even if they havent been registered. Fixing the issue I had with the license server fixed this forever. BTW: no neet to compile with COMPUTE=90, it just works as it should now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants