-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DeepSpeed Example with MPI Operator #2091
Comments
@andreyvelich: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I believe that both (training-operator and mpi-operator) examples would be worth it. But, I think that we should add each example for PyTorchJob with deepspeed and torchrun, and MPIJob v2 with deepspeed and mpirun. |
Sure, that sound great @tenzen-y! |
It sounds great, but I guess that there are no significant performance differences between both approaches since the deepspeed uses the NCCL backend even if we use mpirun. |
I'm working on an equivalent example for the Flux Operator - but quick question. Will it work OK to test without GPU? I've been trying to get just 3 nodes, each with one nvidia GPU on Google Cloud, and I never get the allocation. |
Ah - this looks more promising. https://github.com/kubeflow/mpi-operator/pull/567/files |
@tenzen-y Does DeepSpeed only support nccl backend ? E.g. we can't run it with CPUs ? |
TBH, I don't have any experience only with CPU. But at the first glance, the deepspeed seems to support PyTorch without GPU: https://github.com/microsoft/DeepSpeed/blob/master/.github/workflows/cpu-torch-latest.yml |
This statement is generally correct in almost all cases with NCCL context. Though, I've two few experiences to share for those using mpi-style set-up and suffering performance issue.
Overall, |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
Related: #2040
As we discussed multiple times, Kubeflow community are looking for examples on how to use MPI Operator and DeepSpeed.
We should add some example to the MPI Operator: https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1 or Training Operator: https://github.com/kubeflow/training-operator/tree/master/examples.
Some pending PRs can be found here as reference:
/good-first-issue
/help
/area example
/cc @alculquicondor @kubeflow/wg-training-leads @kuizhiqing
The text was updated successfully, but these errors were encountered: