You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment hardware-observer will only determine which tools should be deployed on a node at installation and when an operator runs the redetect-hardware action.
This forces the operator to manually run the action under certain conditions, and prevents the charm from autohealing, for example when a tool was mistakenly installed due to a bug that no longer exists, or when a host is reconfigured from a vgpu to a gpu-passthrough setup (dcgm should be deployed in the first case, but not in the second).
I propose we consider making the detection algorithm smarter, so that it can be run more frequently. At the same time we have to be careful not to remove monitoring for a piece of hardware that has disappeared because it broke, and not because it has been removed.
For example:
at config change and charm upgrades
redetect hardware
add any new hardware to the stored_tools cache
don't remove any cached tool, unless it's DCGM
only allow completely replacing the cached tools via the redetect-hardware action
The text was updated successfully, but these errors were encountered:
At the moment hardware-observer will only determine which tools should be deployed on a node at installation and when an operator runs the
redetect-hardware
action.This forces the operator to manually run the action under certain conditions, and prevents the charm from autohealing, for example when a tool was mistakenly installed due to a bug that no longer exists, or when a host is reconfigured from a vgpu to a gpu-passthrough setup (dcgm should be deployed in the first case, but not in the second).
I propose we consider making the detection algorithm smarter, so that it can be run more frequently. At the same time we have to be careful not to remove monitoring for a piece of hardware that has disappeared because it broke, and not because it has been removed.
For example:
stored_tools
cacheredetect-hardware
actionThe text was updated successfully, but these errors were encountered: