diff --git a/images/nvidia-container-toolkit/README.md b/images/nvidia-container-toolkit/README.md index 71668de58e..082c81ac9d 100644 --- a/images/nvidia-container-toolkit/README.md +++ b/images/nvidia-container-toolkit/README.md @@ -25,4 +25,29 @@ docker pull cgr.dev/chainguard/nvidia-container-toolkit:latest ``` - + + +## Usage + +```sh +helm repo add nvidia https://helm.ngc.nvidia.com/nvidia +helm upgrade --install gpu-operator nvidia/gpu-operator \ + -n gpu-operator \ + --create-namespace \ + --set toolkit.repository=cgr.dev/chainguard \ + --set toolkit.image=nvidia-container-toolkit \ + --set toolkit.version=latest +``` + +* Refer to [values.yaml](https://github.com/NVIDIA/gpu-operator/blob/master/deployments/gpu-operator/values.yaml) file for more configuration options. + +> [!WARNING] +> You'll want to make sure the `gpu-operator` chart is up-to-date and use the latest operator tag that's within the compatibility matrix. + +> [!IMPORTANT] +> You need GPU nodes to run the operator as it will schedule Deployments and DaemonSets on nodes with GPUs. + +> [!NOTE] +> If you want to learn more about how we are testing this image, please refer to the [TESTING.md](./TESTING.md) file. + + diff --git a/images/nvidia-container-toolkit/TESTING.md b/images/nvidia-container-toolkit/TESTING.md new file mode 100644 index 0000000000..b3da97c116 --- /dev/null +++ b/images/nvidia-container-toolkit/TESTING.md @@ -0,0 +1,267 @@ +# Testing nvidia-container-toolkit + +This describes how to test our `nvidia-container-toolkit` images on a real GKE cluster. + +This document follows the official installation instructions https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html + +We will going to use `gpu-operator` Helm chart to deploy and test the `nvidia-container-toolkit` images. + +## Prerequisites + +* `gcloud` +* `helm` +* `gpu-operator` + +## Installation + +1. Setup your GKE cluster: + +```shell +gcloud beta container clusters create gpu-cluster \ + --project \ + --zone us-west1-a \ + --release-channel "regular" \ + --machine-type "n1-standard-4" \ + --accelerator "type=nvidia-tesla-t4,count=1" \ + --image-type "UBUNTU_CONTAINERD" \ + --disk-type "pd-standard" \ + --disk-size "30" \ + --no-enable-intra-node-visibility \ + --metadata disable-legacy-endpoints=true \ + --max-pods-per-node "110" \ + --num-nodes "1" \ + --logging=SYSTEM,WORKLOAD \ + --monitoring=SYSTEM \ + --enable-ip-alias \ + --no-enable-intra-node-visibility \ + --default-max-pods-per-node "110" \ + --no-enable-master-authorized-networks \ + --tags=nvidia-ingress-all +``` + +2. Apply the following `ResourceQuota`: + +```shell +cat < [!TIP] +> If you want to test an image that you've built locally, you'll need to create and push it to [ArtifactRegistry](https://cloud.google.com/artifact-registry) first. Then replace the `repository` and `image` values in the `values.yaml` file. Or you can use the `latest` tag from the Chainguard registry. + +3. Use the one of the following methods to test the `nvidia-container-toolkit` image: + +a. Pull image from the Chainguard registry: + +* Create your `values.yaml`: + +```shell +cat < values.yaml +toolkit: + repository: cgr.dev/chainguard + image: nvidia-container-toolkit + version: latest +EOF +``` + +b. Push the locally built image to ArtifactRegistry: + +* Ensure the registry is exist: + +```shell +gcloud artifacts repositories list +crane ls -docker.pkg.dev// +``` + +* Build the image: +```shell +TF_VAR_target_repository=-docker.pkg.dev///nvidia-container-toolkit TF_VAR_archs='["amd64"]' make image/nvidia-container-toolkit +``` + +> [!WARNING] +> `amd64` is used to test since the GCP GKE cluster is using the same architecture. + +* Check the image with [crane](https://github.com/google/go-containerregistry/blob/main/cmd/crane/README.md): + +```shell +$ crane ls -docker.pkg.dev///nvidia-container-toolkit +``` + +* Update the values file: + +```shell +cat < values.yaml +toolkit: + repository: -docker.pkg.dev// + image: nvidia-container-toolkit + version: latest +EOF +``` + +* Or, you can also use the following steps to update the image after the deployment, to test the newer builds: + +```shell + +kubectl set image -n gpu-operator daemonset/nvidia-container-toolkit-daemonset nvidia-container-toolkit-ctr="-docker.pkg.dev///nvidia-container-toolkit" +kubectl rollout restart daemonset -n gpu-operator nvidia-container-toolkit-daemonset +``` + +> [!WARNING] +> Don't forget to set `imagePullPolicy` to `Always` if you are going to use the `latest` tag. Also make sure to use the the correct architecture for the image. + +4. Install the `gpu-operator` Helm chart: + +```shell +helm repo add nvidia https://nvidia.github.io/gpu-operator +helm upgrade --install gpu-operator nvidia/gpu-operator \ + -n gpu-operator \ + --create-namespace \ + -f values.yaml +``` + +> [!IMPORTANT] +> You will only see `gpu-operator-node-feature-discovery-*` workloads for the first a few minutes. The rest of the workloads will be created after everything is initialized and validated. Expect evey Pod to be in `Running` state. Usually, it takes about 10 minutes to have everything up and running. + +5. Ensure everything is up and running: + +```shell +NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE gpu-feature-discovery-k8q6x 1/1 Running 0 5m43s +gpu-operator-7bbf8bb6b7-g7z99 1/1 Running 0 6m10s +gpu-operator-node-feature-discovery-gc-79d6d968bb-qvnlr 1/1 Running 0 8m55s +gpu-operator-node-feature-discovery-master-6d9f8d497c-ktfzs 1/1 Running 0 8m55s +gpu-operator-node-feature-discovery-worker-7xqrk 1/1 Running 0 8m55s +nvidia-container-toolkit-daemonset-sscl9 1/1 Running 0 5m43s +nvidia-cuda-validator-95ph2 0/1 Completed 0 42s +nvidia-dcgm-exporter-2jfqc 1/1 Running 0 5m43s +nvidia-device-plugin-daemonset-2t62r 1/1 Running 0 5m43s +nvidia-driver-daemonset-9cnfm 1/1 Running 0 5m59s +nvidia-operator-validator-zt2mt 1/1 Running 0 5m43s +``` + +6. Check the logs if something is wrong: + +```shell +kubectl logs daemonset/nvidia-container-toolkit-daemonset -c nvidia-container-toolkit-ctr -f +``` + +* Logs should be similar to the following: + + +
+ +Logs + +``` + +Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init) +time="2024-05-03T17:25:19Z" level=info msg="Parsing arguments" +time="2024-05-03T17:25:19Z" level=info msg="Starting nvidia-toolkit" +time="2024-05-03T17:25:19Z" level=info msg="Verifying Flags" +time="2024-05-03T17:25:19Z" level=info msg=Initializing +time="2024-05-03T17:25:19Z" level=info msg="Installing toolkit" +time="2024-05-03T17:25:19Z" level=info msg="disabling device node creation since --cdi-enabled=false" +time="2024-05-03T17:25:19Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'" +time="2024-05-03T17:25:19Z" level=info msg="Removing existing NVIDIA container toolkit installation" +time="2024-05-03T17:25:19Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'" +time="2024-05-03T17:25:19Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'" +time="2024-05-03T17:25:19Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'" +time="2024-05-03T17:25:19Z" level=info msg="Finding library libnvidia-container.so.1 (root=)" +time="2024-05-03T17:25:19Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'" +time="2024-05-03T17:25:19Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib/libnvidia-container.so.1.15.0'" +time="2024-05-03T17:25:19Z" level=info msg="Installing '/usr/lib/libnvidia-container.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.15.0'" +time="2024-05-03T17:25:19Z" level=info msg="Installed '/usr/lib/libnvidia-container.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.15.0'" +time="2024-05-03T17:25:19Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.15.0'" +time="2024-05-03T17:25:19Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)" +time="2024-05-03T17:25:19Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'" +time="2024-05-03T17:25:19Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container-go.so.1' => '/usr/lib/libnvidia-container-go.so.1.15.0'" +time="2024-05-03T17:25:19Z" level=info msg="Installing '/usr/lib/libnvidia-container-go.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.15.0'" +time="2024-05-03T17:25:19Z" level=info msg="Installed '/usr/lib/libnvidia-container-go.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.15.0'" +time="2024-05-03T17:25:19Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.15.0'" +time="2024-05-03T17:25:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit" +time="2024-05-03T17:25:19Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'" +time="2024-05-03T17:25:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.cdi' to /usr/local/nvidia/toolkit" +time="2024-05-03T17:25:19Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.cdi' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi'" +time="2024-05-03T17:25:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.legacy' to /usr/local/nvidia/toolkit" +time="2024-05-03T17:25:19Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.legacy' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy'" +time="2024-05-03T17:25:19Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'" +time="2024-05-03T17:25:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit" +time="2024-05-03T17:25:19Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'" +time="2024-05-03T17:25:19Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'" +time="2024-05-03T17:25:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit" +time="2024-05-03T17:25:19Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'" +time="2024-05-03T17:25:19Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'" +time="2024-05-03T17:25:19Z" level=info msg="Installing executable '/usr/bin/nvidia-ctk' to /usr/local/nvidia/toolkit" +time="2024-05-03T17:25:19Z" level=info msg="Installing '/usr/bin/nvidia-ctk' to '/usr/local/nvidia/toolkit/nvidia-ctk.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-ctk.real'" +time="2024-05-03T17:25:19Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-ctk'" +time="2024-05-03T17:25:19Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'" +time="2024-05-03T17:25:19Z" level=info msg="Skipping unset option: nvidia-container-runtime.log-level" +time="2024-05-03T17:25:19Z" level=info msg="Skipping unset option: nvidia-container-runtime.mode" +time="2024-05-03T17:25:19Z" level=info msg="Skipping unset option: nvidia-container-runtime.modes.cdi.annotation-prefixes" +time="2024-05-03T17:25:19Z" level=info msg="Skipping unset option: nvidia-container-runtime.runtimes" +time="2024-05-03T17:25:19Z" level=info msg="Skipping unset option: nvidia-container-cli.debug" +time="2024-05-03T17:25:19Z" level=info msg="Skipping unset option: nvidia-container-runtime.debug" +Using config: +accept-nvidia-visible-devices-as-volume-mounts = false +accept-nvidia-visible-devices-envvar-when-unprivileged = true + +[nvidia-container-cli] + ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real" + path = "/usr/local/nvidia/toolkit/nvidia-container-cli" + root = "/run/nvidia/driver" + +[nvidia-container-runtime] + + [nvidia-container-runtime.modes] + + [nvidia-container-runtime.modes.cdi] + default-kind = "management.nvidia.com/gpu" + +[nvidia-container-runtime-hook] + path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook" + skip-mode-detection = true + +[nvidia-ctk] + path = "/usr/local/nvidia/toolkit/nvidia-ctk" +time="2024-05-03T17:25:19Z" level=info msg="Setting up runtime" +time="2024-05-03T17:25:19Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]" +time="2024-05-03T17:25:19Z" level=info msg="Successfully parsed arguments" +time="2024-05-03T17:25:19Z" level=info msg="Starting 'setup' for containerd" +time="2024-05-03T17:25:19Z" level=info msg="Loading config from /runtime/config-dir/config.toml" +time="2024-05-03T17:25:19Z" level=info msg="Flushing config to /runtime/config-dir/config.toml" +time="2024-05-03T17:25:19Z" level=info msg="Sending SIGHUP signal to containerd" +time="2024-05-03T17:25:19Z" level=info msg="Successfully signaled containerd" +time="2024-05-03T17:25:19Z" level=info msg="Completed 'setup' for containerd" +time="2024-05-03T17:25:19Z" level=info msg="Waiting for signal" +``` +
+ +7. Teardown the cluster: + +```shell +gcloud container clusters delete gpu-cluster +``` diff --git a/images/nvidia-container-toolkit/config/main.tf b/images/nvidia-container-toolkit/config/main.tf index 6e8ad6a108..32f06facab 100644 --- a/images/nvidia-container-toolkit/config/main.tf +++ b/images/nvidia-container-toolkit/config/main.tf @@ -38,6 +38,7 @@ variable "extra_packages" { "libnvidia-container", "nvidia-cuda-cudart-12", "nvidia-cuda-nvml-dev-12", + "bash", ] } @@ -60,7 +61,30 @@ output "config" { }, work-dir = "/work" entrypoint = { - command = "nvidia-toolkit" + command = "/work/nvidia-toolkit" } - }) + paths = [{ + path = "/bin" + type = "directory" + uid = module.accts.uid + gid = module.accts.gid + permissions = 493 + recursive = true + }, { + path = "/run/nvidia" + type = "directory" + uid = module.accts.uid + gid = module.accts.gid + permissions = 493 + recursive = true + }, { + path = "/host" + type = "directory" + uid = module.accts.uid + gid = module.accts.gid + permissions = 493 + recursive = true + }] + } + ) } diff --git a/images/nvidia-container-toolkit/main.tf b/images/nvidia-container-toolkit/main.tf index 59f0328902..f39222b198 100644 --- a/images/nvidia-container-toolkit/main.tf +++ b/images/nvidia-container-toolkit/main.tf @@ -15,23 +15,22 @@ module "nvidia-container-toolkit" { name = basename(path.module) target_repository = var.target_repository config = module.config.config - # build-dev = true + build-dev = true } -# module "test" { -# source = "./tests" -# digest = module.nvidia-container-toolkit.image_ref -# } +module "test" { + source = "./tests" + digest = module.nvidia-container-toolkit.image_ref +} resource "oci_tag" "latest" { - # depends_on = [module.test] + depends_on = [module.test] digest_ref = module.nvidia-container-toolkit.image_ref tag = "latest" } -# resource "oci_tag" "latest-dev" { -# depends_on = [module.test] -# digest_ref = module.nvidia-container-toolkit.dev_ref -# tag = "latest-dev" -# } - +resource "oci_tag" "latest-dev" { + depends_on = [module.test] + digest_ref = module.nvidia-container-toolkit.dev_ref + tag = "latest-dev" +} diff --git a/images/nvidia-container-toolkit/tests/01-smoke.sh b/images/nvidia-container-toolkit/tests/01-smoke.sh new file mode 100755 index 0000000000..2477cab582 --- /dev/null +++ b/images/nvidia-container-toolkit/tests/01-smoke.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash + +set -o errexit -o nounset -o errtrace -o pipefail -x + +CONTAINER_NAME="nvidia-container-toolkit-$(uuidgen)" + +# $ROOT environment variable will be set by the `gpu-operator`. Pass an arbitrary value to test the script. +docker run \ + -d --rm \ + --name "${CONTAINER_NAME}" \ + --privileged \ + -e ROOT="/run/nvidia" \ + "${IMAGE_NAME}" + +# Stop container when script exits +trap "docker stop ${CONTAINER_NAME}" EXIT + +sleep 3 + +# Check if container is still running +for ((i = 0; i < 10; i++)); do + if docker ps --filter "name=${CONTAINER_NAME}" --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then + break + fi + echo "Waiting for container to start..." + sleep 3 +done || { + echo "FAILED: ${CONTAINER_NAME} is not running." + docker ps --all + exit 1 +} + +function dump_logs_and_exit { + echo "Dumping container logs and exiting..." + container_logs=$(docker logs "${CONTAINER_NAME}") + echo "Dumping container logs: ${container_logs}" + exit 1 +} + +sleep 10 + +logs=$(docker logs "${CONTAINER_NAME}" 2>&1) + +# Services that started by supervisor should have entered RUNNING state +true_asserts=("Installing NVIDIA container toolkit config" "Setting up runtime" "Successfully parsed arguments") + +# This image is intended to be run on a host with NVIDIA GPU drivers installed, so it prints out an error message +true_asserts+=("unable to dial: dial unix /var/run/docker.sock") + +for assert in "${true_asserts[@]}"; do + if ! echo "$logs" | grep -q "$assert"; then + echo "AssertTrue failed: $assert" + dump_logs_and_exit + fi +done + +echo "All assertions passed." diff --git a/images/nvidia-container-toolkit/tests/EXAMPLE_TEST.sh b/images/nvidia-container-toolkit/tests/EXAMPLE_TEST.sh deleted file mode 100644 index 348ce1cc10..0000000000 --- a/images/nvidia-container-toolkit/tests/EXAMPLE_TEST.sh +++ /dev/null @@ -1,5 +0,0 @@ -#!/usr/bin/env bash - -set -o errexit -o nounset -o errtrace -o pipefail -x - -# TODO: Implement this test. diff --git a/images/nvidia-container-toolkit/tests/main.tf b/images/nvidia-container-toolkit/tests/main.tf index 139178af93..05e64f7492 100644 --- a/images/nvidia-container-toolkit/tests/main.tf +++ b/images/nvidia-container-toolkit/tests/main.tf @@ -1,6 +1,7 @@ terraform { required_providers { - oci = { source = "chainguard-dev/oci" } + oci = { source = "chainguard-dev/oci" } + imagetest = { source = "chainguard-dev/imagetest" } } } @@ -8,11 +9,10 @@ variable "digest" { description = "The image digest to run tests over." } -// Invoke a script with the test. -// $IMAGE_NAME is populated with the image name by digest. -// TODO: Update or remove this test as appropriate. -data "oci_exec_test" "manifest" { - digest = var.digest - script = "./EXAMPLE_TEST.sh" - working_dir = path.module +data "oci_string" "ref" { input = var.digest } + +# TODO: Convert this to imagetest_harness_container when ready +data "oci_exec_test" "runs" { + digest = var.digest + script = "${path.module}/01-smoke.sh" }