apollo-mini: add scale front for 60m and 130m

zhuhanqing · Jan 3, 2025 · 9c84cb9 · 9c84cb9
1 parent b6d2186
commit 9c84cb9
Show file tree

Hide file tree

Showing 3 changed files with 1 addition and 3 deletions.
diff --git a/README.md b/README.md
@@ -121,7 +121,7 @@ To stabilize training, we adopt the **Norm-Growth Limiter (NL)** from [Fira](htt
 
 There are two ways to apply the Norm-Growth Limiter based on when it's used relative to the heuristical (`scale`):
 1. **After Scaling**: NL is applied after the gradient is multiplied by the `scale`.
-   - Recommended for smaller models or when training involves fewer warmup steps.
+   - Recommended for when training involves fewer warmup steps, e.g., LLaMA 60M and 130M with APOLLO-Mini.
    - Enable this by setting `--scale_front`.
 2. **Before Scaling**: NL is applied before the gradient is scaled.
    - With sufficient warmup steps, both methods yield similar performance for large models.

diff --git a/scripts/pretrain_c4/llama_130m_apollo.sh b/scripts/pretrain_c4/llama_130m_apollo.sh
@@ -14,7 +14,6 @@ torchrun --standalone --nproc_per_node 4 main_pretrain.py \
     --warmup_steps 2000 \
     --num_training_steps 20000 \
     --optimizer apollo_adamw \
-    --scale_front \
     --apollo_scale ${apollo_scale} \
     --rank ${num_rank} \
     --scale_type ${scale_type} \

diff --git a/scripts/pretrain_c4/llama_60m_apollo.sh b/scripts/pretrain_c4/llama_60m_apollo.sh
@@ -14,7 +14,6 @@ torchrun --standalone --nproc_per_node 1 main_pretrain.py \
     --warmup_steps 1000 \
     --num_training_steps 10000 \
     --optimizer apollo_adamw \
-    --scale_front \
     --apollo_scale ${apollo_scale} \
     --rank ${num_rank} \
     --scale_type ${scale_type} \