Skip to content

Commit

Permalink
fix(api/device): fix passing invalid device handle to NVML functions (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
XuehaiPan authored Jan 13, 2025
1 parent aa9148d commit d623531
Show file tree
Hide file tree
Showing 7 changed files with 220 additions and 153 deletions.
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/bug-report.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ body:
```bash
pip3 install --upgrade pipx
pipx run --spec git+https://github.com/XuehaiPan/nvitop.git nvitop
PYTHONFAULTHANDLER=1 pipx run --spec git+https://github.com/XuehaiPan/nvitop.git nvitop
```
- type: checkboxes
Expand Down Expand Up @@ -128,7 +128,7 @@ body:
id: logs
attributes:
label: Logs
description: Run nvitop with `LOGLEVEL=DEBUG nvitop` and paste the output here.
description: Run nvitop with `PYTHONFAULTHANDLER=1 LOGLEVEL=DEBUG nvitop` and paste the output here.
render: text

- type: textarea
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature-request.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ body:
```bash
pip3 install --upgrade pipx
pipx run --spec git+https://github.com/XuehaiPan/nvitop.git nvitop
PYTHONFAULTHANDLER=1 pipx run --spec git+https://github.com/XuehaiPan/nvitop.git nvitop
```
- type: checkboxes
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/questions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ body:
```bash
pip3 install --upgrade pipx
pipx run --spec git+https://github.com/XuehaiPan/nvitop.git nvitop
PYTHONFAULTHANDLER=1 pipx run --spec git+https://github.com/XuehaiPan/nvitop.git nvitop
```
- type: checkboxes
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ repos:
- id: debug-statements
- id: double-quote-string-fixer
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.9.0
rev: v0.9.1
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

- Fix passing invalid device handle (e.g., GPU is lost) to NVML functions by [@XuehaiPan](https://github.com/XuehaiPan) in [#146](https://github.com/XuehaiPan/nvitop/pull/146).
- Fix CUDA device selection tool `nvisel` by [@XuehaiPan](https://github.com/XuehaiPan).

### Removed
Expand Down
5 changes: 2 additions & 3 deletions nvitop-exporter/nvitop_exporter/exporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@

from prometheus_client import REGISTRY, CollectorRegistry, Gauge, Info

from nvitop import Device, MiB, MigDevice, PhysicalDevice, host
from nvitop.api.process import GpuProcess
from nvitop import Device, GpuProcess, MiB, MigDevice, PhysicalDevice, host
from nvitop_exporter.utils import get_ip_address


Expand Down Expand Up @@ -602,7 +601,6 @@ def update_device(self, device: Device) -> None: # pylint: disable=too-many-loc
for pid, process in device.processes().items():
with process.oneshot():
username = process.username()
alive_pids.add((pid, username))
if (pid, username) not in host_snapshots: # noqa: SIM401,RUF100
host_snapshot = host_snapshots[pid, username] = process.host_snapshot()
else:
Expand Down Expand Up @@ -659,6 +657,7 @@ def update_device(self, device: Device) -> None: # pylint: disable=too-many-loc
username=username,
).set(value)

alive_pids.update(host_snapshots)
for pid, username in previous_alive_pids.difference(alive_pids):
for collector in (
self.process_info,
Expand Down
Loading

0 comments on commit d623531

Please sign in to comment.