KEP-2170: Add PyTorch DDP MNIST training example #2387

astefanutti · 2025-01-14T13:35:04Z

What this PR does / why we need it:

This PR adds an example that demonstrates how to train MNIST with PyTorch DDP using the training operator and SDK v2.

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2025-01-14T13:35:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2025-01-14T13:40:27Z

Pull Request Test Coverage Report for Build 12904336983

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 12862208924:	0.0%
Covered Lines:	85
Relevant Lines:	85

💛 - Coveralls

andreyvelich

Thank you for creating this example @astefanutti. We will use it as getting started example!
However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.

So Data Scientists and ML Engineers can quickly take them and execute locally or inside the Kubeflow Notebooks.

Additionally, we are planning to build the testing infra using Papermill to make sure these Notebooks are runnable.

cc @kubeflow/wg-training-leads @Electronic-Waste @akshaychitneni @shravan-achar

andreyvelich · 2025-01-14T14:06:26Z

Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob
So it will be easier to understand.

astefanutti · 2025-01-14T14:14:12Z

However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.

@andreyvelich awesome, sorry if I missed that.

Do I understand it correctly you initially want the examples to be created under /test/e2e/notebooks?

astefanutti · 2025-01-14T14:19:59Z

Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob
So it will be easier to understand.

I can see it's nice to have an example as small as possible. I can certainly remove the "evaluation" part to make it shorter.

That being said, I'd be inclined to assume evaluation is a critical part of the training for any Data Scientist or ML Engineer, so the value is high and it does not add much complexity nor foreign concepts that the train section already has.

andreyvelich · 2025-01-14T14:24:21Z

Do I understand it correctly you initially want the examples to be created under /test/e2e/notebooks?

We can still use the /examples folder for them, maybe we can use the /test/e2e/notebooks folder for additional test suites, if we need them.
For example, we can keep the script to run Notebooks in the e2e/notebooks folder: https://github.com/kubeflow/training-operator/blob/master/scripts/run-notebook.sh

astefanutti · 2025-01-14T14:46:52Z

Thanks, that makes all sense. Keeping examples in the examples directory makes them easier to find, obviously :)

I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case.

Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing?

andreyvelich · 2025-01-14T14:53:21Z

I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case.
Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing?

Sure, go ahead! We can create the E2Es once you have Notebook ready.

Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ?
I think, FashionMNIST might be more representative than MNIST (it is a first example that PyTorch also shows in their tutorials: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html?highlight=nn%20crossentropyloss)

astefanutti · 2025-01-14T15:00:16Z

Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ?
I think, FashionMNIST might be more representative than MNIST (it is a first example that PyTorch also shows in their tutorials: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html?highlight=nn%20crossentropyloss)

Actually I hesitated when I started :)

I agree with you. Let's move it to use FashionMNIST.

andreyvelich · 2025-01-19T23:53:25Z

Hi @astefanutti, did you get a chance to transfer your example into Jupyter Notebook so we can use it as Getting Started example ?

astefanutti · 2025-01-20T08:51:16Z

@andreyvelich yes I'm on it, I should be able to push it quickly.

review-notebook-app · 2025-01-20T15:10:43Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

astefanutti · 2025-01-21T09:56:10Z

@andreyvelich PTAL

Signed-off-by: Antonin Stefanutti <[email protected]>

google-oss-prow bot added the size/L label Jan 14, 2025

google-oss-prow bot requested review from jinchihe and kuizhiqing January 14, 2025 13:35

astefanutti force-pushed the pr-10 branch 3 times, most recently from c953498 to dced478 Compare January 14, 2025 13:57

andreyvelich reviewed Jan 14, 2025

View reviewed changes

astefanutti force-pushed the pr-10 branch from dced478 to 051fcb5 Compare January 14, 2025 14:06

astefanutti force-pushed the pr-10 branch 2 times, most recently from 6b542e9 to 5f45584 Compare January 14, 2025 14:11

astefanutti marked this pull request as draft January 14, 2025 14:24

google-oss-prow bot added the do-not-merge/work-in-progress label Jan 14, 2025

andreyvelich mentioned this pull request Jan 20, 2025

[WIP] Training: Initial Documentation for Kubeflow Trainer V2 kubeflow/website#3958

Open

4 tasks

astefanutti force-pushed the pr-10 branch from 5f45584 to f9c2724 Compare January 20, 2025 15:10

google-oss-prow bot added size/XL and removed size/L labels Jan 20, 2025

astefanutti force-pushed the pr-10 branch from f9c2724 to cb1be8e Compare January 20, 2025 15:15

astefanutti marked this pull request as ready for review January 20, 2025 15:15

google-oss-prow bot removed the do-not-merge/work-in-progress label Jan 20, 2025

astefanutti force-pushed the pr-10 branch 2 times, most recently from 77fbfdc to 95788ea Compare January 20, 2025 17:47

andreyvelich mentioned this pull request Dec 19, 2024

KEP-2170: Kubeflow Training V2 API #2170

Open

19 tasks

andreyvelich mentioned this pull request Jan 21, 2025

Add more AI/ML Training Examples #2040

Open

8 tasks

KEP-2170: Add PyTorch DDP Fashion MNIST training example

5aa5902

Signed-off-by: Antonin Stefanutti <[email protected]>

astefanutti force-pushed the pr-10 branch from 95788ea to 5aa5902 Compare January 22, 2025 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2170: Add PyTorch DDP MNIST training example #2387

KEP-2170: Add PyTorch DDP MNIST training example #2387

astefanutti commented Jan 14, 2025

google-oss-prow bot commented Jan 14, 2025

coveralls commented Jan 14, 2025 •

edited

Loading

andreyvelich left a comment •

edited

Loading

andreyvelich commented Jan 14, 2025

astefanutti commented Jan 14, 2025 •

edited

Loading

astefanutti commented Jan 14, 2025

andreyvelich commented Jan 14, 2025

astefanutti commented Jan 14, 2025

andreyvelich commented Jan 14, 2025

astefanutti commented Jan 14, 2025 •

edited

Loading

andreyvelich commented Jan 19, 2025

astefanutti commented Jan 20, 2025

review-notebook-app bot commented Jan 20, 2025

astefanutti commented Jan 21, 2025

KEP-2170: Add PyTorch DDP MNIST training example #2387

Are you sure you want to change the base?

KEP-2170: Add PyTorch DDP MNIST training example #2387

Conversation

astefanutti commented Jan 14, 2025

google-oss-prow bot commented Jan 14, 2025

coveralls commented Jan 14, 2025 • edited Loading

Pull Request Test Coverage Report for Build 12904336983

Details

💛 - Coveralls

andreyvelich left a comment • edited Loading

Choose a reason for hiding this comment

andreyvelich commented Jan 14, 2025

astefanutti commented Jan 14, 2025 • edited Loading

astefanutti commented Jan 14, 2025

andreyvelich commented Jan 14, 2025

astefanutti commented Jan 14, 2025

andreyvelich commented Jan 14, 2025

astefanutti commented Jan 14, 2025 • edited Loading

andreyvelich commented Jan 19, 2025

astefanutti commented Jan 20, 2025

review-notebook-app bot commented Jan 20, 2025

astefanutti commented Jan 21, 2025

coveralls commented Jan 14, 2025 •

edited

Loading

andreyvelich left a comment •

edited

Loading

astefanutti commented Jan 14, 2025 •

edited

Loading

astefanutti commented Jan 14, 2025 •

edited

Loading