Skip to content

Commit

Permalink
apollo-mini: add scale front for 60m and 130m
Browse files Browse the repository at this point in the history
  • Loading branch information
zhuhanqing committed Jan 3, 2025
1 parent b6d2186 commit 9c84cb9
Show file tree
Hide file tree
Showing 3 changed files with 1 addition and 3 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ To stabilize training, we adopt the **Norm-Growth Limiter (NL)** from [Fira](htt

There are two ways to apply the Norm-Growth Limiter based on when it's used relative to the heuristical (`scale`):
1. **After Scaling**: NL is applied after the gradient is multiplied by the `scale`.
- Recommended for smaller models or when training involves fewer warmup steps.
- Recommended for when training involves fewer warmup steps, e.g., LLaMA 60M and 130M with APOLLO-Mini.
- Enable this by setting `--scale_front`.
2. **Before Scaling**: NL is applied before the gradient is scaled.
- With sufficient warmup steps, both methods yield similar performance for large models.
Expand Down
1 change: 0 additions & 1 deletion scripts/pretrain_c4/llama_130m_apollo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ torchrun --standalone --nproc_per_node 4 main_pretrain.py \
--warmup_steps 2000 \
--num_training_steps 20000 \
--optimizer apollo_adamw \
--scale_front \
--apollo_scale ${apollo_scale} \
--rank ${num_rank} \
--scale_type ${scale_type} \
Expand Down
1 change: 0 additions & 1 deletion scripts/pretrain_c4/llama_60m_apollo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ torchrun --standalone --nproc_per_node 1 main_pretrain.py \
--warmup_steps 1000 \
--num_training_steps 10000 \
--optimizer apollo_adamw \
--scale_front \
--apollo_scale ${apollo_scale} \
--rank ${num_rank} \
--scale_type ${scale_type} \
Expand Down

0 comments on commit 9c84cb9

Please sign in to comment.