-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add PyTorch DDP MNIST training example #2387
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 12904336983Details
💛 - Coveralls |
c953498
to
dced478
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating this example @astefanutti. We will use it as getting started example!
However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.
So Data Scientists and ML Engineers can quickly take them and execute locally or inside the Kubeflow Notebooks.
Additionally, we are planning to build the testing infra using Papermill to make sure these Notebooks are runnable.
cc @kubeflow/wg-training-leads @Electronic-Waste @akshaychitneni @shravan-achar
Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob |
6b542e9
to
5f45584
Compare
@andreyvelich awesome, sorry if I missed that. Do I understand it correctly you initially want the examples to be created under |
I can see it's nice to have an example as small as possible. I can certainly remove the "evaluation" part to make it shorter. That being said, I'd be inclined to assume evaluation is a critical part of the training for any Data Scientist or ML Engineer, so the value is high and it does not add much complexity nor foreign concepts that the train section already has. |
We can still use the |
Thanks, that makes all sense. Keeping examples in the I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case. Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing? |
Sure, go ahead! We can create the E2Es once you have Notebook ready. Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ? |
Actually I hesitated when I started :) I agree with you. Let's move it to use FashionMNIST. |
Hi @astefanutti, did you get a chance to transfer your example into Jupyter Notebook so we can use it as Getting Started example ? |
@andreyvelich yes I'm on it, I should be able to push it quickly. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
77fbfdc
to
95788ea
Compare
@andreyvelich PTAL |
Signed-off-by: Antonin Stefanutti <[email protected]>
What this PR does / why we need it:
This PR adds an example that demonstrates how to train MNIST with PyTorch DDP using the training operator and SDK v2.
Checklist: