Skip to content

Commit

Permalink
update K8s helm chart and docs (#59)
Browse files Browse the repository at this point in the history
Signed-off-by: Dmitry Shmulevich <[email protected]>
  • Loading branch information
dmitsh authored Jan 27, 2025
1 parent 9166017 commit 879fb8c
Show file tree
Hide file tree
Showing 5 changed files with 26 additions and 19 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:
# Login against a Docker registry except on PR
# https://github.com/docker/login-action
- name: Log into registry ${{ env.REGISTRY }}
uses: docker/login-action@343f7c4344506bcbf9b4de18042ae17996df046d # v3.0.0
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
Expand All @@ -54,21 +54,21 @@ jobs:
# https://github.com/docker/metadata-action
- name: Extract Docker metadata
id: meta
uses: docker/metadata-action@96383f45573cb7f253c731d3b3ab81c87ef81934 # v5.0.0
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=semver,pattern={{major}}
type=ref,event=branch
type=sha,priority=100,prefix=,suffix=,format=short
# Build and push Docker image with Buildx (don't push on PR)
# https://github.com/docker/build-push-action
- name: Build and push Docker image
id: build-and-push
uses: docker/build-push-action@0565240e2d4ab88bba5387d719585280857ece09 # v5.0.0
uses: docker/build-push-action@v5
with:
context: .
push: ${{ github.event_name != 'pull_request' }}
Expand Down
2 changes: 1 addition & 1 deletion charts/node-observer/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ image:
repository: ghcr.io/nvidia/topograph
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: "main"
tag: main

imagePullSecrets: []
nameOverride: ""
Expand Down
2 changes: 1 addition & 1 deletion charts/topograph/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ image:
repository: ghcr.io/nvidia/topograph
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: "main"
tag: main

imagePullSecrets: []
nameOverride: ""
Expand Down
6 changes: 3 additions & 3 deletions cmd/topograph/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@ import (
var GitTag string

func main() {
var c string
var cfg string
var version bool
flag.StringVar(&c, "c", "/etc/topograph/topograph-config.yaml", "config file")
flag.StringVar(&cfg, "c", "/etc/topograph/topograph-config.yaml", "config file")
flag.BoolVar(&version, "version", false, "show the version")

klog.InitFlags(nil)
Expand All @@ -47,7 +47,7 @@ func main() {
os.Exit(0)
}

if err := mainInternal(c); err != nil {
if err := mainInternal(cfg); err != nil {
klog.Error(err.Error())
os.Exit(1)
}
Expand Down
27 changes: 17 additions & 10 deletions docs/k8s.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Topograph is a tool designed to enhance scheduling decisions in Kubernetes clust

### Overview

Topograph's primary objective is to assist the Kubernetes scheduler in making intelligent pod placement decisions based on the cluster's network topology. It achieves this by:
Topograph's primary objective is to assist the Kubernetes scheduler in making intelligent pod placement decisions based on the cluster network topology. It achieves this by:

1. Interacting with Cloud Service Providers (CSPs)
2. Extracting cluster topology information
Expand All @@ -16,12 +16,19 @@ Topograph performs the following key actions:

1. **ConfigMap Creation**: Generates a ConfigMap containing topology information. This ConfigMap is not currently utilized but serves as an example for potential future integration with the scheduler or other systems.

2. **Node Labeling**: Applies labels to nodes that define their position within the cloud topology. For example, if a node connects to switch S1, which connects to switch S2, and then to switch S3, Topograph will apply the following labels to the node:
2. **Node Labeling**: Applies labels to nodes that define their position within the cloud network topology:
- `accelerator`: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)
- `block`: Rack-level switches connecting hosts in one or more racks as a block.
- `spine`: Spine-level switches connecting multiple blocks inside a datacenter.
- `datacenter`: Zonal switches connecting multiple datacenters inside an availability zone.

For example, if a node belongs to NVLink domain `nvl1` and connects to switch `s1`, which connects to switch `s2`, and then to switch `s3`, Topograph will apply the following labels to the node:

```
topology.kubernetes.io/network-level-1: S1
topology.kubernetes.io/network-level-2: S2
topology.kubernetes.io/network-level-3: S3
network.topology.kubernetes.io/accelerator: nvl1
network.topology.kubernetes.io/block: s1
network.topology.kubernetes.io/spine: s2
network.topology.kubernetes.io/datacenter: s3
```

### Use of Topograph
Expand All @@ -46,7 +53,7 @@ closer network proximity.
operator: In
values:
- myapp
topologyKey: topology.kubernetes.io/network-level-2
topologyKey: network.topology.kubernetes.io/spine
- weight: 90
podAffinityTerm:
labelSelector:
Expand All @@ -55,15 +62,15 @@ closer network proximity.
operator: In
values:
- myapp
topologyKey: topology.kubernetes.io/network-level-1
topologyKey: network.topology.kubernetes.io/block
```
Pods are prioritized to be placed on nodes sharing the label `topology.kubernetes.io/network-level-1`.
Pods are prioritized to be placed on nodes sharing the label `network.topology.kubernetes.io/block`.
These nodes are connected to the same network switch, ensuring the lowest latency for communication.

Nodes with the label `topology.kubernetes.io/network-level-2` are next in priority.
Nodes with the label `network.topology.kubernetes.io/spine` are next in priority.
Pods on these nodes will still be relatively close, but with slightly higher latency.

In the three-tier network, all nodes will share the same `topology.kubernetes.io/network-level-3` label,
In the three-tier network, all nodes will share the same `network.topology.kubernetes.io/datacenter` label,
so it doesn’t need to be included in pod affinity settings.

Since the default Kubernetes scheduler places one pod at a time, the placement may vary depending on where
Expand Down

0 comments on commit 879fb8c

Please sign in to comment.