-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H100 gpus do not reach 100% #114
Comments
im using 100%
compile option > make COMPUTE=90 ./gpu_burn -d 3600 |
It turned out to be a problem with the license server: nvidia-smi -q|grep -i lic pointed me the right direction (saying the hardware was unregistered): after a reboot, the gpus work for a while. even if they havent been registered. Fixing the issue I had with the license server fixed this forever. BTW: no neet to compile with COMPUTE=90, it just works as it should now. |
I'm trying to test a bunch of H100 gpus, but I am unable to reach 100% of utilization.
root@ainode01:~# nvidia-smi
Fri Oct 25 11:20:13 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 35C P0 134W / 700W | 72790MiB / 81559MiB | 2% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 109W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 30C P0 114W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 35C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 37C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 32C P0 110W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 34C P0 110W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 30C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 272544 C ./gpu_burn 72780MiB |
| 1 N/A N/A 272720 C ./gpu_burn 72780MiB |
| 2 N/A N/A 272722 C ./gpu_burn 72780MiB |
| 3 N/A N/A 272724 C ./gpu_burn 72780MiB |
| 4 N/A N/A 272726 C ./gpu_burn 72780MiB |
| 5 N/A N/A 272728 C ./gpu_burn 72780MiB |
| 6 N/A N/A 272730 C ./gpu_burn 72780MiB |
| 7 N/A N/A 272732 C ./gpu_burn 72780MiB |
+-----------------------------------------------------------------------------------------+
root@ainode01:~#
some seconds later:
root@ainode01:~# nvidia-smi
Fri Oct 25 11:21:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 39C P0 145W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 140W / 700W | 72790MiB / 81559MiB | 4% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 34C P0 140W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 41C P0 138W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 41C P0 148W / 700W | 72790MiB / 81559MiB | 11% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 35C P0 133W / 700W | 72790MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 37C P0 133W / 700W | 72790MiB / 81559MiB | 2% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 32C P0 135W / 700W | 72790MiB / 81559MiB | 11% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 272544 C ./gpu_burn 72780MiB |
| 1 N/A N/A 272720 C ./gpu_burn 72780MiB |
| 2 N/A N/A 272722 C ./gpu_burn 72780MiB |
| 3 N/A N/A 272724 C ./gpu_burn 72780MiB |
| 4 N/A N/A 272726 C ./gpu_burn 72780MiB |
| 5 N/A N/A 272728 C ./gpu_burn 72780MiB |
| 6 N/A N/A 272730 C ./gpu_burn 72780MiB |
| 7 N/A N/A 272732 C ./gpu_burn 72780MiB |
+-----------------------------------------------------------------------------------------+
root@ainode01:~#
what am I missing?
The text was updated successfully, but these errors were encountered: