Skip to content

Commit

Permalink
KEP-2170: Add PyTorch DDP Fashion MNIST training example
Browse files Browse the repository at this point in the history
Signed-off-by: Antonin Stefanutti <[email protected]>
  • Loading branch information
astefanutti committed Jan 22, 2025
1 parent 1dfa40c commit 5aa5902
Show file tree
Hide file tree
Showing 3 changed files with 597 additions and 0 deletions.
112 changes: 112 additions & 0 deletions examples/pytorch/mnist-ddp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# PyTorch DDP Fashion MNIST Training Example

This example demonstrates how to train a neural network to classify images
using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset
and [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).

You can either run this example with the provided Jupyter notebook,
or by running the Python script directly.

In any case, you need to install the Kubeflow training v2 control plane
on your Kubernetes cluster, if it's not already deployed:

```console
kubectl apply --server-side -k "https://github.com/kubeflow/training-operator.git/manifests/v2/overlays/standalone?ref=master"
```

## Jupyter Notebook

You can set up your environment by running the following commands:

```console
python -m venv .venv
source .venv/bin/activate
pip install jupyter
```

And start the notebook by running:

```console
jupyter notebook examples/pytorch/mnist-ddp/mnist.ipynb
```

You can then access the notebook from your Web browser and follow the instructions.

## Python Script

### Setup

You need to set up the Python environment on your local machine or client:

```console
python -m venv .venv
source .venv/bin/activate
pip install git+https://github.com/kubeflow/training-operator.git@master#subdirectory=sdk_v2
```

You can refer to the [training operator documentation](https://www.kubeflow.org/docs/components/training/installation/)
for more information.

### Usage

```console
python mnist.py --help
usage: mnist.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N] [--lr LR] [--lr-gamma G] [--lr-period P] [--seed S] [--log-interval N] [--save-model]
[--backend {gloo,nccl}] [--num-workers N] [--worker-resources RESOURCE QUANTITY] [--runtime NAME]

PyTorch DDP Fashion MNIST Training Example

options:
-h, --help show this help message and exit
--batch-size N input batch size for training [100]
--test-batch-size N input batch size for testing [100]
--epochs N number of epochs to train [10]
--lr LR learning rate [1e-1]
--lr-gamma G learning rate decay factor [0.5]
--lr-period P learning rate decay period in step size [20]
--seed S random seed [0]
--log-interval N how many batches to wait before logging training metrics [10]
--save-model saving the trained model [False]
--backend {gloo,nccl}
Distributed backend [nccl]
--num-workers N Number of workers [1]
--worker-resources RESOURCE QUANTITY
Resources per worker [cpu: 1, memory: 2Gi, nvidia.com/gpu: 1]
--runtime NAME the training runtime [torch-distributed]
```

### Example

Train the model on 8 worker nodes using 1 NVIDIA GPU each:

```console
python mnist.py \
--num-workers 4 \
--worker-resources "nvidia.com/gpu" 1 \
--worker-resource cpu 4 \
--worker-resources memory 16Gi \
--epochs 100 \
--batch-size 100 \
--lr 1e-1 \
--lr-period 25 \
--lr-gamma 0.7
```

At the end of each epoch, local metrics are printed in each worker logs and the global metrics
are gathered and printed in the rank 0 worker logs.

When the training completes, you should see the following at the end of the rank 0 worker logs:

```text
--------------- Epoch 50 Evaluation ---------------
Local rank 0:
- Loss: 0.0039
- Accuracy: 2258/2500 (90%)
Global metrics:
- Loss: 0.004262
- Accuracy: 9023/10000 (90.23%)
---------------------------------------------------
```
141 changes: 141 additions & 0 deletions examples/pytorch/mnist-ddp/mnist.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# PyTorch DDP Fashion MNIST Training Example"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"This example demonstrates how to train a neural network to classify images using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset and [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Install the Kubeflow Training Python SDK\n",
"\n",
"You need to install the Kubeflow Training SDK to run this Notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create the Kubeflow Training Client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from kubeflow.training import Trainer, TrainingClient\n",
"from mnist import train_fashion_mnist"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = TrainingClient()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Start the Train Job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"job_name = client.train(\n",
" runtime_ref=\"torch-distributed\",\n",
" trainer=Trainer(\n",
" func=train_fashion_mnist,\n",
" func_args={\n",
" \"backend\": \"nccl\",\n",
" \"batch_size\": 100,\n",
" \"test_batch_size\": 100,\n",
" \"epochs\": 100,\n",
" \"lr\": 1e-1,\n",
" \"lr_gamma\": 0.7,\n",
" \"lr_period\": 25,\n",
" \"seed\": 0,\n",
" \"log_interval\": 10,\n",
" \"save_model\": False,\n",
" },\n",
" num_nodes=4,\n",
" resources_per_node={\n",
" \"nvidia.com/gpu\": 1,\n",
" },\n",
" ),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Watch the Train Job Logs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client.get_job_logs(job_name, follow=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading

0 comments on commit 5aa5902

Please sign in to comment.