The following guidance assumes Azure Kubernetes Service(AKS) is used to host the Kubernetes cluster. If you want to use Elastic Kubernetes Service (EKS) instead, please follow the installation guide here
Before you begin, ensure you have the following tools installed:
- Azure CLI to provision Azure resources
- Helm to install this operator
- kubectl to view Kubernetes resources
- git to clone this repo locally
- jq to process JSON files
Important Note: Ensure you use a release branch of the repository for a stable version of the installation.
If you do not already have an AKS cluster, run the following Azure CLI commands to create one:
export RESOURCE_GROUP="myResourceGroup"
export MY_CLUSTER="myCluster"
export LOCATION="eastus"
az group create --name $RESOURCE_GROUP --location $LOCATION
az aks create --resource-group $RESOURCE_GROUP --name $MY_CLUSTER --enable-oidc-issuer --enable-workload-identity --enable-managed-identity --generate-ssh-keys
Connect to the AKS cluster.
az aks get-credentials --resource-group $RESOURCE_GROUP --name $MY_CLUSTER
If you do not have kubectl
installed locally, you can install using the following Azure CLI command.
az aks install-cli
Be sure you've cloned this repo and connected to your AKS cluster before attempting to install the Helm charts.
Install the Workspace controller.
helm install workspace ./charts/kaito/workspace --namespace kaito-workspace --create-namespace
Note that if you have installed another node provisioning controller that supports Karpenter-core APIs, the following steps for installing gpu-provisioner
can be skipped.
The gpu-provisioner controller requires the workload identity feature to acquire the access token to the AKS cluster.
Run the following commands only if your AKS cluster does not already have the Workload Identity and OIDC issuer features enabled.
export RESOURCE_GROUP="myResourceGroup"
export MY_CLUSTER="myCluster"
az aks update -g $RESOURCE_GROUP -n $MY_CLUSTER --enable-oidc-issuer --enable-workload-identity --enable-managed-identity
The identity kaitoprovisioner
is created for the gpu-provisioner controller. It is assigned Contributor role for the managed cluster resource to allow changing $MY_CLUSTER
(e.g., provisioning new nodes in it).
export SUBSCRIPTION=$(az account show --query id -o tsv)
export IDENTITY_NAME="kaitoprovisioner"
az identity create --name $IDENTITY_NAME -g $RESOURCE_GROUP
export IDENTITY_PRINCIPAL_ID=$(az identity show --name $IDENTITY_NAME -g $RESOURCE_GROUP --subscription $SUBSCRIPTION --query 'principalId' -o tsv)
az role assignment create --assignee $IDENTITY_PRINCIPAL_ID --scope /subscriptions/$SUBSCRIPTION/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.ContainerService/managedClusters/$MY_CLUSTER --role "Contributor"
Install the Node provisioner controller.
# get additional values for helm chart install
export GPU_PROVISIONER_VERSION=0.2.1
curl -sO https://raw.githubusercontent.com/Azure/gpu-provisioner/main/hack/deploy/configure-helm-values.sh
chmod +x ./configure-helm-values.sh && ./configure-helm-values.sh $MY_CLUSTER $RESOURCE_GROUP $IDENTITY_NAME
helm install gpu-provisioner --values gpu-provisioner-values.yaml --set settings.azure.clusterName=$MY_CLUSTER --wait \
https://github.com/Azure/gpu-provisioner/raw/gh-pages/charts/gpu-provisioner-$GPU_PROVISIONER_VERSION.tgz --namespace gpu-provisioner --create-namespace
The federated identity credential between the managed identity kaitoprovisioner
and the service account used by the gpu-provisioner controller is created.
export AKS_OIDC_ISSUER=$(az aks show -n $MY_CLUSTER -g $RESOURCE_GROUP --subscription $SUBSCRIPTION --query "oidcIssuerProfile.issuerUrl" -o tsv)
az identity federated-credential create --name kaito-federatedcredential --identity-name $IDENTITY_NAME -g $RESOURCE_GROUP --issuer $AKS_OIDC_ISSUER --subject system:serviceaccount:"gpu-provisioner:gpu-provisioner" --audience api://AzureADTokenExchange --subscription $SUBSCRIPTION
Then the gpu-provisioner can access the managed cluster using a trust token with the same permissions of the kaitoprovisioner
identity.
Note that before finishing this step, the gpu-provisioner controller pod will constantly fail with the following message in the log:
panic: Configure azure client fails. Please ensure federatedcredential has been created for identity XXXX.
The pod will reach running state once the federated credential is created.
You can run the following commands to verify the installation of the controllers were successful.
Check status of the Helm chart installations.
helm list -n kaito-workspace
helm list -n gpu-provisioner
Check status of the workspace
.
kubectl describe deploy workspace -n kaito-workspace
Check status of the gpu-provisioner
.
kubectl describe deploy gpu-provisioner -n gpu-provisioner
If you see that the gpu-provisioner
deployment is not running after some time, it's possible that some values incorrect in your values.ovveride.yaml
.
Run the following command to check gpu-provisioner
pod logs for additional details.
kubectl logs --selector=app.kubernetes.io\/name=gpu-provisioner -n gpu-provisioner
helm uninstall gpu-provisioner
helm uninstall workspace