Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

AI for Science at Scale -- Part III

Recording: https://vimeo.com/983121540
Slides: https://www.olcf.ornl.gov/wp-content/uploads/AI4S-Tutorial-Part-III.pdf

In Part 3 of this training series, we will demonstrate how to train LLMs with hundreds of billions of parameters from scratch. We will utilize two of our external repostitories/branches for this purpose.

Training Very Large LLMs

Get Megatron-DeepSpeed Codebase Ported to Frontier

git clone https://github.com/sajal-vt/Megatron-DeepSpeed-ORNL.git
cd Megatron-DeepSpeed-ORNL/
git fetch
git switch FA2

Using the batch scripts mentioned below

You must modify the batch scripts due to a permissions issue. You have to replace these 2 lines (top of batch script):

source /lustre/orion/world-shared/stf218/sajal/miniconda3-frontier/bin/activate
conda activate ...

with:

module load miniforge3
source activate ...

Training a Model with 22 Billion Parameters on 2 Nodes

sbatch -A your_project_ID --reservation=ai launch_gpt22b_bf16.slurm

Training a Model with 175 Billion Parameters on 16 Nodes

sbatch -A your_project_ID --reservation=ai launch_gpt175b_bf16.slurm

Training a Model with 1 Trillion Parameters on 128 Nodes

sbatch -A your_project_ID --reservation=ai launch_gpt1T_bf16.slurm

Finding the Best Distributed Training Strategy using DeepHyper

Get frontier-sd Branch

git clone https://github.com/sajal-vt/Megatron-DeepSpeed-ORNL.git
cd Megatron-DeepSpeed-ORNL/
git fetch
git switch frontier-sd

Launch HyperParameter Search using DeepHyper

sbatch -A your_project_ID --reservation=ai launch_dh.frontier

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai_at_scale_part_3

ai_at_scale_part_3

README.md

AI for Science at Scale -- Part III

Training Very Large LLMs

Get Megatron-DeepSpeed Codebase Ported to Frontier

Using the batch scripts mentioned below

Training a Model with 22 Billion Parameters on 2 Nodes

Training a Model with 175 Billion Parameters on 16 Nodes

Training a Model with 1 Trillion Parameters on 128 Nodes

Finding the Best Distributed Training Strategy using DeepHyper

Get frontier-sd Branch

Launch HyperParameter Search using DeepHyper

Files

ai_at_scale_part_3

Directory actions

More options

Directory actions

More options

Latest commit

History

ai_at_scale_part_3

Folders and files

parent directory

README.md

AI for Science at Scale -- Part III

Training Very Large LLMs

Get Megatron-DeepSpeed Codebase Ported to Frontier

Using the batch scripts mentioned below

Training a Model with 22 Billion Parameters on 2 Nodes

Training a Model with 175 Billion Parameters on 16 Nodes

Training a Model with 1 Trillion Parameters on 128 Nodes

Finding the Best Distributed Training Strategy using DeepHyper

Get frontier-sd Branch

Launch HyperParameter Search using DeepHyper