Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically redetect hardware #381

Open
aieri opened this issue Dec 18, 2024 · 2 comments
Open

Periodically redetect hardware #381

aieri opened this issue Dec 18, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@aieri
Copy link
Contributor

aieri commented Dec 18, 2024

At the moment hardware-observer will only determine which tools should be deployed on a node at installation and when an operator runs the redetect-hardware action.

This forces the operator to manually run the action under certain conditions, and prevents the charm from autohealing, for example when a tool was mistakenly installed due to a bug that no longer exists, or when a host is reconfigured from a vgpu to a gpu-passthrough setup (dcgm should be deployed in the first case, but not in the second).

I propose we consider making the detection algorithm smarter, so that it can be run more frequently. At the same time we have to be careful not to remove monitoring for a piece of hardware that has disappeared because it broke, and not because it has been removed.

For example:

  • at config change and charm upgrades
  • redetect hardware
  • add any new hardware to the stored_tools cache
  • don't remove any cached tool, unless it's DCGM
  • only allow completely replacing the cached tools via the redetect-hardware action
@aieri aieri added the enhancement New feature or request label Dec 18, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/SOLENG-1001.

This message was autogenerated

Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/SOLENG-1000.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant