Skip to content

Commit

Permalink
update benchmark configs (#100)
Browse files Browse the repository at this point in the history
Signed-off-by: Dmitry Shmulevich <[email protected]>
  • Loading branch information
dmitsh authored Aug 17, 2024
1 parent 8ed1cdd commit 320d1ec
Show file tree
Hide file tree
Showing 31 changed files with 157 additions and 107 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,4 @@ Here's a demo showing how to install and configure `Knavigator`, and run an exam
- [Getting started](docs/getting_started.md)
- [Task management](docs/task_management.md)
- [Metrics and Dashboards](docs/metrics.md)
- [Benchmarking](resources/benchmarks/README.md)
12 changes: 6 additions & 6 deletions resources/benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,36 +18,36 @@ Run:ai requires additional customization and thus has a separate workflow

## Gang Scheduling Benchmark Test

The gang-scheduling benchmark workflow operates on 32 virtual GPU nodes, submitting a burst of 53 jobs with replica numbers ranging from 1 to 32 in a [predetermined order](gang-scheduling/workflows/run-test-common.yml).
The gang-scheduling benchmark workflow operates on 32 virtual GPU nodes, submitting a burst of 53 jobs with replica numbers ranging from 1 to 32 in a [predetermined order](gang-scheduling/workflows/run-test.yaml).

#### Example

To run the benchmark test for Kueue:

```bash
./bin/knavigator -workflow 'resources/benchmarks/gang-scheduling/workflows/{config-kueue.yml,run-test-common.yml}'
./bin/knavigator -workflow 'resources/benchmarks/gang-scheduling/workflows/{config-kueue.yaml,run-test.yaml}'
```

#### Run:ai

```bash
./bin/knavigator -workflow resources/benchmarks/gang-scheduling/workflows/run-test-runai.yml
./bin/knavigator -workflow resources/benchmarks/gang-scheduling/workflows/runai-test.yaml
```

## Scaling Benchmark Test

The scaling benchmark workflow operates on 500 virtual GPU nodes, submitting [two workloads](workflows/run-test-common.yml) one after another. The first workload is a job with 500 replicas, the second workload is 500 single node jobs started simultaneously.
The scaling benchmark workflow operates on 500 virtual GPU nodes with tho workflows. The first [workflow](scaling/workflows/run-test-multi.yaml) submits is a job with 500 replicas, the second [workflow](scaling/workflows/run-test-single.yaml) submits a batch of 500 single-node jobs.

### Example

To run the benchmark test for Volcano:

```bash
./bin/knavigator -workflow 'resources/benchmarks/scaling/workflows/{config-volcano.yml,run-test-common.yml}'
./bin/knavigator -workflow 'resources/benchmarks/scaling/workflows/{config-nodes.yaml,config-volcano.yaml,run-test-multi.yaml}'
```

### Run:ai

```bash
./bin/knavigator -workflow resources/benchmarks/scaling/workflows/run-test-runai.yml
./bin/knavigator -workflow 'resources/benchmarks/scaling/workflows/{config-nodes.yaml,config-runai.yaml,runai-test-single.yaml}'
```
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
name: config-kueue
description: register, deploy and configure kueue custom resources
tasks:
- id: register-cluster-queue
type: RegisterObj
params:
template: "resources/templates/kueue/cluster-queue.yml"
template: "resources/templates/kueue/cluster-queue.yaml"
- id: register-local-queue
type: RegisterObj
params:
template: "resources/templates/kueue/local-queue.yml"
template: "resources/templates/kueue/local-queue.yaml"
- id: register-resource-flavor
type: RegisterObj
params:
template: "resources/templates/kueue/resource-flavor.yml"
template: "resources/templates/kueue/resource-flavor.yaml"
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/kueue/job.yml"
template: "resources/benchmarks/templates/kueue/job.yaml"
nameFormat: "job{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-[0-9]-.*"
podCount: "{{.replicas}}"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
name: config-volcano
description: register, deploy and configure volcano custom resources
tasks:
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/volcano/job.yml"
template: "resources/benchmarks/templates/volcano/job.yaml"
nameFormat: "j{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-test-[0-9]+"
podCount: "{{.replicas}}"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
name: config-yunikorn
description: register, deploy and configure yunikorn custom resources
tasks:
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/yunikorn/job.yml"
template: "resources/benchmarks/templates/yunikorn/job.yaml"
nameFormat: "job{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-.*"
podCount: "{{.replicas}}"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ tasks:
- id: register-trainingworkload
type: RegisterObj
params:
template: "resources/benchmarks/templates/runai/trainingworkload.yml"
template: "resources/benchmarks/templates/runai/trainingworkload.yaml"
nameFormat: "twl{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-0-0"
podCount: 1
- id: register-distributedworkload
type: RegisterObj
params:
template: "resources/benchmarks/templates/runai/distributedworkload.yml"
template: "resources/benchmarks/templates/runai/distributedworkload.yaml"
nameFormat: "dwl{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-(launcher-[a-z0-9]+|worker-[0-9]+)"
podCount: "{{.workers}} + 1"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
name: config-kueue
description: register, deploy and configure kueue custom resources
tasks:
- id: register-cluster-queue
type: RegisterObj
params:
template: "resources/templates/kueue/cluster-queue.yml"
template: "resources/templates/kueue/cluster-queue.yaml"
- id: register-local-queue
type: RegisterObj
params:
template: "resources/templates/kueue/local-queue.yml"
template: "resources/templates/kueue/local-queue.yaml"
- id: register-resource-flavor
type: RegisterObj
params:
template: "resources/templates/kueue/resource-flavor.yml"
template: "resources/templates/kueue/resource-flavor.yaml"
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/kueue/job.yml"
template: "resources/benchmarks/templates/kueue/job.yaml"
nameFormat: "job{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-[0-9]-.*"
podCount: "{{.replicas}}"
Expand Down
12 changes: 12 additions & 0 deletions resources/benchmarks/scaling/workflows/config-nodes.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
name: config-nodes
description: create 500 virtual GPU nodes
tasks:
- id: configure
type: Configure
params:
nodes:
- type: dgxa100.80g
count: 500
labels:
nvidia.com/gpu.count: "8"
timeout: 5m
17 changes: 17 additions & 0 deletions resources/benchmarks/scaling/workflows/config-runai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: config-runai
description: register, deploy and configure run:ai custom resources
tasks:
- id: register-trainingworkload
type: RegisterObj
params:
template: "resources/benchmarks/templates/runai/trainingworkload.yaml"
nameFormat: "twl{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-0-0"
podCount: 1
- id: register-mpi
type: RegisterObj
params:
template: "resources/benchmarks/templates/runai/mpijob.yaml"
nameFormat: "mpijob{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-(launcher-[a-z0-9]+|worker-[0-9]+)"
podCount: "{{.workers}} + 1"
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
name: config-volcano
description: register, deploy and configure volcano custom resources
tasks:
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/volcano/job.yml"
template: "resources/benchmarks/templates/volcano/job.yaml"
nameFormat: "j{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-test-[0-9]+"
podCount: "{{.replicas}}"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
name: config-yunikorn
description: register, deploy and configure yunikorn custom resources
tasks:
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/yunikorn/job.yml"
template: "resources/benchmarks/templates/yunikorn/job.yaml"
nameFormat: "job{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-.*"
podCount: "{{.replicas}}"
Expand Down
31 changes: 0 additions & 31 deletions resources/benchmarks/scaling/workflows/run-test-common.yml

This file was deleted.

11 changes: 11 additions & 0 deletions resources/benchmarks/scaling/workflows/run-test-multi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: test-scaling-multi-node-job
description: deploy a 500-replicas job
tasks:
- id: job
type: SubmitObj
params:
refTaskId: register
count: 1
params:
replicas: 500
ttl: 2m
47 changes: 0 additions & 47 deletions resources/benchmarks/scaling/workflows/run-test-runai.yml

This file was deleted.

11 changes: 11 additions & 0 deletions resources/benchmarks/scaling/workflows/run-test-single.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: test-scaling-single-node-jobs
description: deploy 500 single-replica jobs
tasks:
- id: job
type: SubmitObj
params:
refTaskId: register
count: 500
params:
replicas: 1
ttl: 2m
11 changes: 11 additions & 0 deletions resources/benchmarks/scaling/workflows/runai-test-multi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: test-scaling
description: deploy a 500-replicas job
tasks:
- id: job
type: SubmitObj
params:
refTaskId: register-mpi
count: 1
params:
workers: 499
ttl: 2m
10 changes: 10 additions & 0 deletions resources/benchmarks/scaling/workflows/runai-test-single.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: test-scaling
description: deploy 500 single-replica jobs
tasks:
- id: job
type: SubmitObj
params:
refTaskId: register-trainingworkload
count: 500
params:
ttl: 2m
49 changes: 49 additions & 0 deletions resources/benchmarks/templates/runai/mpijob.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: "{{._NAME_}}"
namespace: runai-<RUNAI_PROJECT>
labels:
project: <RUNAI_PROJECT>
runai/queue: <RUNAI_PROJECT>
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
annotations:
pod-complete.stage.kwok.x-k8s.io/delay: {{.ttl}}
pod-complete.stage.kwok.x-k8s.io/jitter-delay: {{.ttl}}
spec:
schedulerName: runai-scheduler
containers:
- image: runai/mpi-launcher:latest
name: mpi-launcher
resources:
limits:
cpu: 100m
memory: 250M
nvidia.com/gpu: 8
Worker:
replicas: {{.workers}}
template:
metadata:
annotations:
pod-complete.stage.kwok.x-k8s.io/delay: {{.ttl}}
pod-complete.stage.kwok.x-k8s.io/jitter-delay: {{.ttl}}
labels:
app: {{._NAME_}}
spec:
schedulerName: runai-scheduler
containers:
- image: runai/mpi-worker:latest
name: mpi-worker
resources:
limits:
cpu: 100m
memory: 250M
nvidia.com/gpu: 8
File renamed without changes.
File renamed without changes.
File renamed without changes.
8 changes: 4 additions & 4 deletions resources/workflows/kueue/test-job.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,19 @@ tasks:
- id: register-cluster-queue
type: RegisterObj
params:
template: "resources/templates/kueue/cluster-queue.yml"
template: "resources/templates/kueue/cluster-queue.yaml"
- id: register-local-queue
type: RegisterObj
params:
template: "resources/templates/kueue/local-queue.yml"
template: "resources/templates/kueue/local-queue.yaml"
- id: register-resource-flavor
type: RegisterObj
params:
template: "resources/templates/kueue/resource-flavor.yml"
template: "resources/templates/kueue/resource-flavor.yaml"
- id: register-job
type: RegisterObj
params:
template: "resources/templates/kueue/job.yml"
template: "resources/templates/kueue/job.yaml"
nameFormat: "job{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-[0-9]-.*"
podCount: "{{.parallelism}}"
Expand Down
Loading

0 comments on commit 320d1ec

Please sign in to comment.