Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep image up-to-date with its components #8

Closed
jgiannuzzi opened this issue Mar 14, 2022 · 3 comments
Closed

Keep image up-to-date with its components #8

jgiannuzzi opened this issue Mar 14, 2022 · 3 comments

Comments

@jgiannuzzi
Copy link
Contributor

Describe the improvement request

The following components should be kept up-to-date in the image:

  • Ubuntu base image
  • GitHub Actions Runner
  • NVIDIA drivers

I think that a daily scheduled workflow could be used to check for updates to each of those components, and create a PR with the new version number set in variables.auto.pkrvars.hcl when an update is detected. We may also want to have a workflow running on PRs that builds and tests the image, but does not save it.

Here is how I think the check for updates could be done for each component:

Ubuntu base image

Use gcloud to query for the latest image in the ubuntu-2004-lts family

gcloud compute images list --filter family=ubuntu-2004-lts --format "value(NAME)"
GitHub Actions Runner

Use the GitHub REST API to query for the latest release (the v prefix will need to be removed)

gh api /repos/actions/runner/releases/latest | jq -r .tag_name
NVIDIA drivers

The decision to go from one major version to another should be done by a human (e.g. by updating the scheduled workflow).
The latest datacenter driver versions can be found on https://docs.nvidia.com/datacenter/tesla/index.html. We should probably parse this HTML file (e.g. with https://pypi.org/project/beautifulsoup4/ or https://developer.mozilla.org/en-US/docs/Web/API/DOMParser/parseFromString) and figure out the latest version for a given major (e.g. 470 -> 470.103.01 at the time of writing).

We don't need to have all 3 components done before we can roll out this feature. It seems to me that the first 2 are low hanging fruits and we would immediately benefit from having these always up-to-date.

@pavlovic-ivan
Copy link
Owner

Hi @jgiannuzzi. Thanks for this issue, and i agree. However, what do say about splitting this into three separate issues? Issuer per component version? I would like to implement GitHub Actions Runner version fix asap, and if i would go that way, i would close this ticket... Looks like GitHub Actions Runner versions are creating an actual issue, and maybe GCP wasn't actually preempting instances. Looks like Runner process wasn't able to start at all, because 2.292.0 is an "old" version. 2.294.0 is latest. I did some manual setups in AWS, and tried to register runner 2.292.0 and failed. Error was "Runner too old". This might be the reason why ILGPU can not run cuda jobs at the moment.

So, runners version fix to me looks like high priority.

What do you think?

@jgiannuzzi
Copy link
Contributor Author

Totally agree. Go ahead and split the issue!

@pavlovic-ivan
Copy link
Owner

Closing this issue. Scheduled workflow check for new version of runners. Two new issues added for ubuntu version and nvidia drivers:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants