Get help for distributed model training on MI250 #30

OswaldHe · 2024-10-02T22:23:08Z

Hi,

I would like to test a program for distributed LLM model training on mi2508x and I want to do model parallel to distribute parameters across GPUs. Is there any framework that I should use to achieve that? I used DeepSpeed (https://github.com/microsoft/DeepSpeed), but their ZeRO stage-3 will actually increase memory consumption of all GPUs compared with ZeRO stage-2, which only do optimizer distribution. Is there any resource/recommendation and some examples specifically for AMD GPUs?

Thank you,
Zifan

tom-papatheodore · 2024-10-07T17:43:06Z

Hey Zifan-

I'm not personally very familiar with all the different options for model-parallel distributed training, but you might check the ROCm blogs since they have a lot of LLM examples on AMD GPUs: https://rocm.blogs.amd.com/blog/category/applications-models.html

-Tom

OswaldHe · 2024-10-07T18:03:58Z

Hi Tom,

Thank you. I found what I need on the ROCm blog and will get back to you if that doesn't work.

Zifan

tom-papatheodore · 2024-10-08T14:54:35Z

Hey @OswaldHe-

I know you said you found what you needed, but I figured it wouldn't hurt to share this as well...

Here is a blog post that describes how AMD trained a small-language model (SLM) on AMD GPUs w/ distributed training: https://www.amd.com/en/developer/resources/technical-articles/introducing-amd-first-slm-135m-model-fuels-ai-advancements.html

At the bottom, in the Call to Actions section, there is a link to the GitHub where you can reproduce the model yourself. I don't think you necessarily want to do that, but it should provide an example of using PyTorch FSDP for multi-node distributed training.

-Tom

Alexis-BX · 2024-11-08T17:49:56Z

Hi @OswaldHe
Could you share how you got deepspeed working please?
I had deepspeed training running on an other MI250 cluster but here I am getting launcher errors...
Thanks for the help!

OswaldHe · 2024-11-08T18:04:41Z

Hi @Alexis-BX

I just installed huggingface accelerate and deepspeed in a conda environment: https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed.

Alexis-BX · 2024-11-08T18:15:08Z

Hi
Thanks for the reply.
I already have both installed thanks.
Which launcher do you use please?
I usually use the pdsh launcher but it does not seem installed here.
(Or if you just want to put your deepspeed command/config file I can look through it thanks!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get help for distributed model training on MI250 #30

Get help for distributed model training on MI250 #30

OswaldHe commented Oct 2, 2024 •

edited

Loading

tom-papatheodore commented Oct 7, 2024

OswaldHe commented Oct 7, 2024

tom-papatheodore commented Oct 8, 2024

Alexis-BX commented Nov 8, 2024

OswaldHe commented Nov 8, 2024

Alexis-BX commented Nov 8, 2024

Get help for distributed model training on MI250 #30

Get help for distributed model training on MI250 #30

Comments

OswaldHe commented Oct 2, 2024 • edited Loading

tom-papatheodore commented Oct 7, 2024

OswaldHe commented Oct 7, 2024

tom-papatheodore commented Oct 8, 2024

Alexis-BX commented Nov 8, 2024

OswaldHe commented Nov 8, 2024

Alexis-BX commented Nov 8, 2024

OswaldHe commented Oct 2, 2024 •

edited

Loading