Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get help for distributed model training on MI250 #30

Open
OswaldHe opened this issue Oct 2, 2024 · 6 comments
Open

Get help for distributed model training on MI250 #30

OswaldHe opened this issue Oct 2, 2024 · 6 comments

Comments

@OswaldHe
Copy link

OswaldHe commented Oct 2, 2024

Hi,

I would like to test a program for distributed LLM model training on mi2508x and I want to do model parallel to distribute parameters across GPUs. Is there any framework that I should use to achieve that? I used DeepSpeed (https://github.com/microsoft/DeepSpeed), but their ZeRO stage-3 will actually increase memory consumption of all GPUs compared with ZeRO stage-2, which only do optimizer distribution. Is there any resource/recommendation and some examples specifically for AMD GPUs?

Thank you,
Zifan

@tom-papatheodore
Copy link
Collaborator

Hey Zifan-

I'm not personally very familiar with all the different options for model-parallel distributed training, but you might check the ROCm blogs since they have a lot of LLM examples on AMD GPUs: https://rocm.blogs.amd.com/blog/category/applications-models.html

-Tom

@OswaldHe
Copy link
Author

OswaldHe commented Oct 7, 2024

Hi Tom,

Thank you. I found what I need on the ROCm blog and will get back to you if that doesn't work.

Zifan

@tom-papatheodore
Copy link
Collaborator

Hey @OswaldHe-

I know you said you found what you needed, but I figured it wouldn't hurt to share this as well...

Here is a blog post that describes how AMD trained a small-language model (SLM) on AMD GPUs w/ distributed training: https://www.amd.com/en/developer/resources/technical-articles/introducing-amd-first-slm-135m-model-fuels-ai-advancements.html

At the bottom, in the Call to Actions section, there is a link to the GitHub where you can reproduce the model yourself. I don't think you necessarily want to do that, but it should provide an example of using PyTorch FSDP for multi-node distributed training.

-Tom

@Alexis-BX
Copy link

Hi @OswaldHe
Could you share how you got deepspeed working please?
I had deepspeed training running on an other MI250 cluster but here I am getting launcher errors...
Thanks for the help!

@OswaldHe
Copy link
Author

OswaldHe commented Nov 8, 2024

Hi @Alexis-BX

I just installed huggingface accelerate and deepspeed in a conda environment: https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed.

@Alexis-BX
Copy link

Hi
Thanks for the reply.
I already have both installed thanks.
Which launcher do you use please?
I usually use the pdsh launcher but it does not seem installed here.
(Or if you just want to put your deepspeed command/config file I can look through it thanks!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants