- Recording: https://vimeo.com/983121540
- Slides: https://www.olcf.ornl.gov/wp-content/uploads/AI4S-Tutorial-Part-III.pdf
In Part 3 of this training series, we will demonstrate how to train LLMs with hundreds of billions of parameters from scratch. We will utilize two of our external repostitories/branches for this purpose.
git clone https://github.com/sajal-vt/Megatron-DeepSpeed-ORNL.git
cd Megatron-DeepSpeed-ORNL/
git fetch
git switch FA2
You must modify the batch scripts due to a permissions issue. You have to replace these 2 lines (top of batch script):
source /lustre/orion/world-shared/stf218/sajal/miniconda3-frontier/bin/activate
conda activate ...
with:
module load miniforge3
source activate ...
sbatch -A your_project_ID --reservation=ai launch_gpt22b_bf16.slurm
sbatch -A your_project_ID --reservation=ai launch_gpt175b_bf16.slurm
sbatch -A your_project_ID --reservation=ai launch_gpt1T_bf16.slurm
git clone https://github.com/sajal-vt/Megatron-DeepSpeed-ORNL.git
cd Megatron-DeepSpeed-ORNL/
git fetch
git switch frontier-sd
sbatch -A your_project_ID --reservation=ai launch_dh.frontier