Skip to content

Commit

Permalink
[cherry-pick] [docs] fine tune llama with trainium (#48768) (#48854)
Browse files Browse the repository at this point in the history
Introduce a new Ray Train example for AWS Trainium. 

![CleanShot 2024-11-16 at 12 48

57@2x](https://github.com/user-attachments/assets/8b7d12d8-846f-497f-ba25-fd8a613f9007)

Marked it as a community example as it is something we are collaborating
with AWS Neuron team.

![CleanShot 2024-11-16 at 12 48

37@2x](https://github.com/user-attachments/assets/589d8ff3-fcb6-4b90-865d-006bcb4815a3)

Docs screenshots

<img width="1142" alt="Screenshot 2024-11-20 at 11 19 39 AM"
src="https://github.com/user-attachments/assets/aa3dadf7-96b9-46cc-8b6d-44c3e3bc3e1e">
<img width="1161" alt="Screenshot 2024-11-20 at 11 19 47 AM"
src="https://github.com/user-attachments/assets/859508fd-e47e-4758-a4c7-f15a749ece82">
<img width="1149" alt="Screenshot 2024-11-20 at 11 19 54 AM"
src="https://github.com/user-attachments/assets/28858f36-8cca-4eaa-a8ec-a1f7dda899d0">

---------

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Saihajpreet Singh <[email protected]>
Co-authored-by: Saihajpreet Singh <[email protected]>
  • Loading branch information
chris-ray-zhang and saihaj authored Nov 22, 2024
1 parent 1dfd52b commit a46e3f4
Show file tree
Hide file tree
Showing 3 changed files with 115 additions and 1 deletion.
1 change: 1 addition & 0 deletions doc/source/custom_directives.py
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,7 @@ def key(cls: type) -> str:
class Framework(ExampleEnum):
"""Framework type for example metadata."""

AWSNEURON = "AWS Neuron"
PYTORCH = "PyTorch"
LIGHTNING = "Lightning"
TRANSFORMERS = "Transformers"
Expand Down
12 changes: 11 additions & 1 deletion doc/source/train/examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,17 @@ examples:
contributor: community
link: examples/intel_gaudi/llama_pretrain

- title: Fine-tune a Llama-2 text generation models with DeepSpeed and Hugging Face Accelerate
- title: Fine-tune Llama3.1 with AWS Trainium
frameworks:
- pytorch
- aws neuron
skill_level: advanced
use_cases:
- natural language processing
- large language models
contributor: community
link: examples/aws-trainium/llama3
- title: Fine-tune a Llama-2 text generation model with DeepSpeed and Hugging Face Accelerate
frameworks:
- accelerate
- deepspeed
Expand Down
103 changes: 103 additions & 0 deletions doc/source/train/examples/aws-trainium/llama3.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
:orphan:

Distributed fine-tuning of Llama 3.1 8B on AWS Trainium with Ray and PyTorch Lightning
======================================================================================


This example demonstrates how to fine-tune the `Llama 3.1 8B <https://huggingface.co/NousResearch/Meta-Llama-3.1-8B/>`__ model on `AWS
Trainium <https://aws.amazon.com/ai/machine-learning/trainium/>`__ instances using Ray Train, PyTorch Lightning, and AWS Neuron SDK.

AWS Trainium is the machine learning (ML) chip that AWS built for deep
learning (DL) training of 100B+ parameter models. `AWS Neuron
SDK <https://aws.amazon.com/machine-learning/neuron/>`__ helps
developers train models on Trainium accelerators.

Prepare the environment
-----------------------

See `Setup EKS cluster and tools <https://github.com/aws-neuron/aws-neuron-eks-samples/tree/master/llama3.1_8B_finetune_ray_ptl_neuron#setupeksclusterandtools>`__ for setting up an Amazon EKS cluster leveraging AWS Trainium instances.

Create a Docker image
---------------------
When the EKS cluster is ready, create an Amazon ECR repository for building and uploading the Docker image containing artifacts for fine-tuning a Llama3.1 8B model:

1. Clone the repo.

::

git clone https://github.com/aws-neuron/aws-neuron-eks-samples.git

2. Go to the ``llama3.1_8B_finetune_ray_ptl_neuron`` directory.

::

cd aws-neuron-eks-samples/llama3.1_8B_finetune_ray_ptl_neuron

3. Trigger the script.

::

chmod +x 0-kuberay-trn1-llama3-finetune-build-image.sh
./0-kuberay-trn1-llama3-finetune-build-image.sh

4. Enter the zone your cluster is running in, for example: us-east-2.

5. Verify in the AWS console that the Amazon ECR service has the newly
created ``kuberay_trn1_llama3.1_pytorch2`` repository.

6. Update the ECR image ARN in the manifest file used for creating the Ray cluster.

Replace the <AWS_ACCOUNT_ID> and <REGION> placeholders with actual values in the ``1-llama3-finetune-trn1-create-raycluster.yaml`` file using commands below to reflect the ECR image ARN created above:

::

export AWS_ACCOUNT_ID=<enter_your_aws_account_id> # for ex: 111222333444
export REGION=<enter_your_aws_region> # for ex: us-east-2
sed -i "s/<AWS_ACCOUNT_ID>/$AWS_ACCOUNT_ID/g" 1-llama3-finetune-trn1-create-raycluster.yaml
sed -i "s/<REGION>/$REGION/g" 1-llama3-finetune-trn1-create-raycluster.yaml

Configuring Ray Cluster
-----------------------

The ``llama3.1_8B_finetune_ray_ptl_neuron`` directory in the AWS Neuron samples repository simplifies the
Ray configuration. KubeRay provides a manifest that you can apply
to the cluster to set up the head and worker pods.

Run the following command to set up the Ray cluster:

::

kubectl apply -f 1-llama3-finetune-trn1-create-raycluster.yaml


Accessing Ray Dashboard
-----------------------
Port forward from the cluster to see the state of the Ray dashboard and
then view it on `http://localhost:8265 <http://localhost:8265/>`__.
Run it in the background with the following command:

::

kubectl port-forward service/kuberay-trn1-head-svc 8265:8265 &

Launching Ray Jobs
------------------

The Ray cluster now ready to handle workloads. Initiate the data preparation and fine-tuning Ray jobs:

1. Launch the Ray job for downloading the dolly-15k dataset and the Llama3.1 8B model artifacts:

::

kubectl apply -f 2-llama3-finetune-trn1-rayjob-create-data.yaml

2. When the job has executed successfully, run the following fine-tuning job:

::

kubectl apply -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml

3. Monitor the jobs via the Ray Dashboard


For detailed information on each of the steps above, see the `AWS documentation link <https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/llama3.1_8B_finetune_ray_ptl_neuron/README.md/>`__.

0 comments on commit a46e3f4

Please sign in to comment.