Skip to content

Commit

Permalink
added kueue example
Browse files Browse the repository at this point in the history
Signed-off-by: Dmitry Shmulevich <[email protected]>
  • Loading branch information
dmitsh committed May 10, 2024
1 parent f72e040 commit 3add7bd
Show file tree
Hide file tree
Showing 9 changed files with 197 additions and 1 deletion.
36 changes: 36 additions & 0 deletions charts/overrides/kwok/pod-complete.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
apiVersion: kwok.x-k8s.io/v1alpha1
kind: Stage
metadata:
name: pod-complete
spec:
next:
statusTemplate: |
{{`{{ $now := Now }}
{{ $root := . }}
containerStatuses:
{{ range $index, $item := .spec.containers }}
{{ $origin := index $root.status.containerStatuses $index }}
- image: {{ $item.image | Quote }}
name: {{ $item.name | Quote }}
ready: false
restartCount: 0
started: false
state:
terminated:
exitCode: 0
finishedAt: {{ $now | Quote }}
reason: Completed
startedAt: {{ $now | Quote }}
{{ end }}
phase: Succeeded`}}
resourceRef:
apiGroup: v1
kind: Pod
selector:
matchExpressions:
- key: .metadata.deletionTimestamp
operator: DoesNotExist
- key: .status.phase
operator: In
values:
- Running
7 changes: 7 additions & 0 deletions docs/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,16 @@ KWOK_REPO=kubernetes-sigs/kwok
KWOK_LATEST_RELEASE="v0.5.2"

kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/kwok.yaml"
```

Next, deploy and adjust the stages.
```bash
kubectl apply -f "https://github.com/${KWOK_REPO}/releases/download/${KWOK_LATEST_RELEASE}/stage-fast.yaml"

kubectl apply -f https://github.com/${KWOK_REPO}/raw/main/kustomize/stage/pod/chaos/pod-init-container-running-failed.yaml
kubectl apply -f https://github.com/${KWOK_REPO}/raw/main/kustomize/stage/pod/chaos/pod-container-running-failed.yaml

kubectl apply -f charts/overrides/kwok/pod-complete.yml
```

For configuring virtual nodes, you need to provide the `values.yaml` file to define the type and quantity of nodes you wish to create. You also have the option to enhance node configurations by adding annotations, labels, and conditions. For guidance, refer to the [values-example.yaml](../charts/virtual-nodes/values-example.yaml) file.
Expand All @@ -62,6 +67,8 @@ To deploy the nodes in `values-example.yaml`, use the Helm command:
helm install virtual-nodes charts/virtual-nodes -f charts/virtual-nodes/values-example.yaml
```

> :warning: **Warning:** Ensure you deploy virtual nodes as the final step before launching `knavigator`. If you deploy any components after virtual nodes are created, the pods for these components might be assigned to virtual nodes, which could will their functionality.
## Running Knavigator

Knavigator can be deployed inside a Kubernetes cluster or used externally from outside the cluster.
Expand Down
39 changes: 39 additions & 0 deletions docs/examples/kueue/kueue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Example of running `kueue` with `knavigator`

## Preparatory step

To ensure proper installation of `kueue`, verify that your cluster does not contain any virtual nodes. If the `kueue` controller is deployed on a virtual node, it will disrupt its functionality.

```bash
helm delete virtual-nodes
```

## Install kueue

Install kueue by following these [instructions](https://kueue.sigs.k8s.io/docs/installation/):

```bash
KUEUE_VERSION=v0.6.2
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$KUEUE_VERSION/manifests.yaml
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$KUEUE_VERSION/prometheus.yaml
```

## Deploy cluster and local queues

```bash
kubectl apply -f docs/examples/kueue/queues.yml
```

## Deploy virtual nodes

In this example we deploy 4 GPU nodes. Refer to [values.yaml](values.yaml) for more details.

```bash
helm install virtual-nodes charts/virtual-nodes -f docs/examples/kueue/values.yaml
```

## Run kueue job

```bash
./bin/knavigator -tasks resources/tests/kueue/test-job.yml
```
30 changes: 30 additions & 0 deletions docs/examples/kueue/queues.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# cluster-queue.yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "default-flavor"
resources:
- name: "cpu"
nominalQuota: 4
- name: "memory"
nominalQuota: 36Gi
- name: "nvidia.com/gpu"
nominalQuota: 4
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: team-a-queue
spec:
clusterQueue: cluster-queue
30 changes: 30 additions & 0 deletions docs/examples/kueue/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

nodes:
- type: dgxa100.80g
count: 4
annotations: {}
labels:
nvidia.com/gpu.count: "8"
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
conditions:
- message: Filesystem is not read-only
reason: FilesystemIsNotReadOnly
status: "False"
type: ReadonlyFilesystem
- message: kernel has no deadlock
reason: KernelHasNoDeadlock
status: "False"
type: KernelDeadlock
4 changes: 4 additions & 0 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,7 @@ Run a test jobset with a driver and workers:
```shell
./bin/knavigator -tasks ./resources/tests/k8s/test-jobset-with-driver.yml
```

### Kueue

Refer to [this document](./examples/kueue/kueue.md) for detailed instructions on how to run `kueue` system with `knavigator`.
27 changes: 27 additions & 0 deletions resources/templates/kueue/job.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
apiVersion: batch/v1
kind: Job
metadata:
name: "{{._NAME_}}"
namespace: {{.namespace}}
labels:
kueue.x-k8s.io/queue-name: {{.queueName}}
spec:
completions: {{.completions}}
parallelism: {{.parallelism}}
completionMode: {{.completionMode}}
template:
spec:
containers:
- name: test
image: {{.image}}
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: "{{.cpu}}"
memory: {{.memory}}
nvidia.com/gpu: "{{.gpu}}"
requests:
cpu: "{{.cpu}}"
memory: {{.memory}}
nvidia.com/gpu: "{{.gpu}}"
restartPolicy: Never
23 changes: 23 additions & 0 deletions resources/tests/kueue/test-job.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: test-kueue-job
description: submit and validate a kueue job
tasks:
- id: job
type: SubmitObj
params:
count: 1
grv:
group: batch
version: v1
resource: jobs
template: "resources/templates/kueue/job.yml"
nameformat: "job{{._ENUM_}}"
overrides:
queueName: team-a-queue
namespace: default
parallelism: 3
completions: 3
completionMode: Indexed
image: ubuntu
cpu: 100m
memory: 512M
gpu: 1
2 changes: 1 addition & 1 deletion resources/tests/volcano/test-job.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ tasks:
type: CheckPod
params:
refTaskId: job
status: Completed
status: Running
nodeLabels:
nodeType: gpu
timeout: 5s

0 comments on commit 3add7bd

Please sign in to comment.