Add DeepSpeed Example with MPI Operator #2091

andreyvelich · 2024-04-29T12:48:16Z

Related: #2040

As we discussed multiple times, Kubeflow community are looking for examples on how to use MPI Operator and DeepSpeed.

We should add some example to the MPI Operator: https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1 or Training Operator: https://github.com/kubeflow/training-operator/tree/master/examples.

Some pending PRs can be found here as reference:

/good-first-issue
/help
/area example

/cc @alculquicondor @kubeflow/wg-training-leads @kuizhiqing

google-oss-prow · 2024-04-29T12:48:19Z

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Related: #2040

As we discussed multiple times, Kubeflow community are looking for examples on how to use MPI Operator and DeepSpeed.

We should add some example to the MPI Operator: https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1 or Training Operator: https://github.com/kubeflow/training-operator/tree/master/examples.

Some pending PRs can be found here as reference:

add deepspeed example mpi-operator#610

(integration) deepspeed_mpi specific container, deepspeed_config for MPI with nodetaints mpi-operator#567

/good-first-issue
/help
/area example

/cc @alculquicondor @kubeflow/wg-training-leads @kuizhiqing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tenzen-y · 2024-04-29T12:52:19Z

I believe that both (training-operator and mpi-operator) examples would be worth it. But, I think that we should add each example for PyTorchJob with deepspeed and torchrun, and MPIJob v2 with deepspeed and mpirun.

andreyvelich · 2024-04-29T13:56:35Z

Sure, that sound great @tenzen-y!
It would be great to see the benchmarks for mpirun and torchrun to run DeepSpeed on Kubernetes.

tenzen-y · 2024-04-29T14:00:10Z

Sure, that sound great @tenzen-y! It would be great to see the benchmarks for mpirun and torchrun to run DeepSpeed on Kubernetes.

It sounds great, but I guess that there are no significant performance differences between both approaches since the deepspeed uses the NCCL backend even if we use mpirun.

vsoch · 2024-04-29T18:28:52Z

I'm working on an equivalent example for the Flux Operator - but quick question. Will it work OK to test without GPU? I've been trying to get just 3 nodes, each with one nvidia GPU on Google Cloud, and I never get the allocation.

vsoch · 2024-04-29T20:26:25Z

Ah - this looks more promising. https://github.com/kubeflow/mpi-operator/pull/567/files

andreyvelich · 2024-04-30T12:58:06Z

@tenzen-y Does DeepSpeed only support nccl backend ? E.g. we can't run it with CPUs ?

tenzen-y · 2024-05-07T11:51:35Z

@tenzen-y Does DeepSpeed only support nccl backend ? E.g. we can't run it with CPUs ?

TBH, I don't have any experience only with CPU. But at the first glance, the deepspeed seems to support PyTorch without GPU: https://github.com/microsoft/DeepSpeed/blob/master/.github/workflows/cpu-torch-latest.yml

kuizhiqing · 2024-06-08T15:27:17Z

that there are no significant performance differences between both approaches since the deepspeed uses the NCCL backend even if we use mpirun.

This statement is generally correct in almost all cases with NCCL context. Though, I've two few experiences to share for those using mpi-style set-up and suffering performance issue.

Avoid using mpi library(mpi4py e.g.) to do any collective communication, for some metrics collection etc., even if they only using TCP network.
DO NOT produce too much logs in mpirun scenario, the mpirun would collect logs from all the workers which may hurt the performance

Overall, mpirun and torchrun should have no performance difference.

github-actions · 2024-09-06T20:01:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-09-27T00:07:13Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

google-oss-prow bot added area/example good first issue help wanted labels Apr 29, 2024

Syulin7 mentioned this issue Aug 27, 2024

Add DeepSpeed Example with Pytorch Operator #2235

Merged

1 task

github-actions bot added the lifecycle/stale label Sep 6, 2024

github-actions bot closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSpeed Example with MPI Operator #2091

Add DeepSpeed Example with MPI Operator #2091

andreyvelich commented Apr 29, 2024

google-oss-prow bot commented Apr 29, 2024

tenzen-y commented Apr 29, 2024

andreyvelich commented Apr 29, 2024

tenzen-y commented Apr 29, 2024

vsoch commented Apr 29, 2024

vsoch commented Apr 29, 2024

andreyvelich commented Apr 30, 2024

tenzen-y commented May 7, 2024

kuizhiqing commented Jun 8, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 27, 2024

Add DeepSpeed Example with MPI Operator #2091

Add DeepSpeed Example with MPI Operator #2091

Comments

andreyvelich commented Apr 29, 2024

google-oss-prow bot commented Apr 29, 2024

tenzen-y commented Apr 29, 2024

andreyvelich commented Apr 29, 2024

tenzen-y commented Apr 29, 2024

vsoch commented Apr 29, 2024

vsoch commented Apr 29, 2024

andreyvelich commented Apr 30, 2024

tenzen-y commented May 7, 2024

kuizhiqing commented Jun 8, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 27, 2024