Refactor tpu e2e test files #1186

khatwanimohit · 2025-01-22T02:05:41Z

Description

This PR refactors all tpu e2e model test files.
First bash script is run on CPU and contains all checkpoint related tasks as follows:

Convert parent model checkpoint to MaxText compatible orbax checkpoint
Convert scanned checkpoint to Unscanned for efficient decoding
Convert MaxText checkpoint to HF

Second bash script runs on TPU and contains all model related test like pre-training, full finetuning and decoding.

This PR aims to separate direct dependency of modelling tests on checkpoint creation. All model tests will find the most recent checkpoint to run tests on. All checkpoint related tests will be run on a separate cadence.

Following model test files are updated in this PR

Llama2-7B
Llama2-70B
Llama3.1-8B
LLama3.1-70B
Gemma-2B
Gemma-7B
Mistral-7B

FIXES: b/376935929

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

A9isha · 2025-01-27T19:26:17Z

end_to_end/tpu/gemma/2b/1_test_gemma_2b.sh

@@ -0,0 +1,35 @@
+#!/bin/bash
+
+# This file runs once a day on a CPU and has follows: 


nit: "has follows" and "as follows" in consecutive statements seem redundant

A9isha · 2025-01-27T22:28:14Z

end_to_end/tpu/gemma/2b/1_test_gemma_2b.sh

+# `SCANNED_CHECKPOINT` is the path to the GCS bucket where we want to save our converted (Orbax) checkpoint. Non-Googlers please remember to point `SCANNED_CHECKPOINT` to a GCS bucket that you own
+export SCANNED_CHECKPOINT=${CKPT_BUCKET}/${MODEL}/${RUN_ID}/scanned
+export UNSCANNED_CHECKPOINT=${CKPT_BUCKET}/${MODEL}/${RUN_ID}
+export HF_CHECKPOINT=${CKPT_BUCKET}/${MODEL}/${RUN_ID}/huggingface


I think we should be doing Convert MaxText checkpoint to HF here

A9isha · 2025-01-27T22:40:49Z

end_to_end/tpu/gemma/2b/2_test_gemma_2b.sh


 # We also test whether the forward pass logits match the golden logits for Gemma-2b
-python3 MaxText/tests/forward_pass_logit_checker.py  MaxText/configs/base.yml tokenizer_path=assets/tokenizer.gemma load_parameters_path=${UNSCANNED_CKPT_PATH} run_name=forward_pass_test_gemma2b per_device_batch_size=1 model_name=gemma-2b max_prefill_predict_length=4 max_target_length=4 dataset_type=synthetic scan_layers=false attention=dot_product --max_kl_div=0.01
+python3 MaxText/tests/forward_pass_logit_checker.py  MaxText/configs/base.yml tokenizer_path=assets/tokenizer.gemma load_parameters_path=${UNSCANNED_CKPT_PATH} run_name=forward_pass_test_gemma2b per_device_batch_size=1 model_name=${MODEL} max_prefill_predict_length=4 max_target_length=4 dataset_type=synthetic scan_layers=false attention=dot_product --max_kl_div=0.01


We should put in the forward pass test for the HF_CHECKPOINT too

A9isha · 2025-01-27T22:41:36Z

end_to_end/tpu/gemma/7b/1_test_gemma_7b.sh

+# `SCANNED_CHECKPOINT` is the path to the GCS bucket where we want to save our converted (Orbax) checkpoint. Non-Googlers please remember to point `SCANNED_CHECKPOINT` to a GCS bucket that you own
+export SCANNED_CHECKPOINT=${CKPT_BUCKET}/${MODEL}/${RUN_ID}/scanned
+export UNSCANNED_CHECKPOINT=${CKPT_BUCKET}/${MODEL}/${RUN_ID}
+export HF_CHECKPOINT=${CKPT_BUCKET}/${MODEL}/${RUN_ID}/huggingface


A9isha · 2025-01-27T22:43:07Z

end_to_end/tpu/gemma/7b/2_test_gemma_7b.sh

+export ASYNC_CHECKPOINTING=false
+export UNSCANNED_CHECKPOINT=${CKPT_BUCKET}/${MODEL}/${RUN_ID}/unscanned/checkpoints/0/items
+export SCANNED_CHECKPOINT=${CKPT_BUCKET}/${MODEL}/${RUN_ID}/scanned/0/items
+export HF_CHECKPOINT=${CKPT_BUCKET}/${MODEL}/${RUN_ID}/huggingface


A9isha · 2025-01-27T23:20:37Z

end_to_end/tpu/llama2/7b/1_test_llama2_7b.sh

+
+gcloud storage cp -r /tmp/hf_llama2 ${HF_CHECKPOINT}
+
+echo "All Checkpoints saved with RUN_ID=${RUN_ID}"


nit: let's put a newline

A9isha · 2025-01-27T23:20:49Z

end_to_end/tpu/llama3.1/70b/1_test_llama3.1_70b.sh

+
+# gcloud storage cp -r /tmp/hf_llama ${HF_CHECKPOINT}
+
+echo "All Checkpoints saved with RUN_ID=${RUN_ID}"


nit: let's put a newline

A9isha · 2025-01-27T23:21:30Z

end_to_end/tpu/llama3.1/70b/2_test_llama3.1_70b.sh

+# We also test whether the forward pass logits match the golden logits for LLama3.1-8B
+python3 MaxText/tests/forward_pass_logit_checker.py MaxText/configs/base.yml base_output_directory=${BASE_OUTPUT_DIRECTORY} tokenizer_path=assets/tokenizer_llama3.tiktoken load_parameters_path=${UNSCANNED_CHECKPOINT} run_name=forward_pass_test per_device_batch_size=1 model_name=${MODEL} max_prefill_predict_length=4 max_target_length=4 dataset_type=synthetic dtype=float32 activations_in_float32=true matmul_precision=float32 async_checkpointing=false scan_layers=false  --max_kl_div=1e-4
+
+# TODO(b/391634569): converting to HF checkpoint OOMs


Similarly let's skip llama3.1-70b from the PR description then

A9isha · 2025-01-27T23:22:45Z

end_to_end/tpu/mistral/7b/1_test_mistral_7b.sh

+
+# Generate unscanned ckpt for efficient decoding test
+JAX_PLATFORMS=cpu python MaxText/generate_param_only_checkpoint.py MaxText/configs/base.yml async_checkpointing=false base_output_directory=${UNSCANNED_CHECKPOINT} load_parameters_path=${SCANNED_CHECKPOINT} run_name=unscanned model_name='mistral-7b' force_unroll=true
+


no hf conversion?

A9isha · 2025-01-27T23:23:32Z

end_to_end/tpu/mistral/7b/2_test_mistral_7b.sh

+python3 MaxText/decode.py MaxText/configs/base.yml load_parameters_path=${SCANNED_CHECKPOINT} run_name=scanned_decoding per_device_batch_size=1 model_name=mistral-7b async_checkpointing=false tokenizer_path=assets/tokenizer.mistral-v1 max_prefill_predict_length=11 max_target_length=16 prompt="[INST] I love to [/INST]" attention=dot_product megablox=False sparse_matmul=False
+
+# Test whether the forward pass logits match the golden logits - matmul implementation
+python3 MaxText/tests/forward_pass_logit_checker.py MaxText/configs/base.yml base_output_directory=${BASE_OUTPUT_DIRECTORY} load_parameters_path=${SCANNED_CHECKPOINT} run_name=matmul_forward_pass_test per_device_batch_size=1 model_name=mistral-7b tokenizer_path=assets/tokenizer.mistral-v1 max_prefill_predict_length=11 max_target_length=11 dataset_type=synthetic dtype=float32 megablox=False sparse_matmul=False --atol=3 --rtol=1 --token_size=4


let's put forward pass logit checking with hf too

A9isha · 2025-01-28T19:30:21Z

end_to_end/tpu/gemma/2b/1_test_gemma_2b.sh

+# 2. Create MaxText compatible unscanned orbax checkpoint
+
+set -ex
+RUN_ID=$(date +%Y-%m-%d-%H-%M)


can use use BASE_OUTPUT_PATH like here, it is much simpler to use this way for our nightly tests

khatwanimohit force-pushed the mohit/ckpt_reorg branch 7 times, most recently from 4a3263c to 15d9d53 Compare January 23, 2025 18:20

khatwanimohit marked this pull request as ready for review January 23, 2025 18:29

khatwanimohit requested review from gobbleturk, bvandermoon, vipannalla and RissyRan as code owners January 23, 2025 18:29

khatwanimohit assigned shralex, gobbleturk and A9isha Jan 23, 2025

khatwanimohit requested review from A9isha and shralex January 23, 2025 18:29

khatwanimohit force-pushed the mohit/ckpt_reorg branch 3 times, most recently from 83ef498 to 676bcaf Compare January 24, 2025 21:02

A9isha requested changes Jan 27, 2025

View reviewed changes

A9isha reviewed Jan 28, 2025

View reviewed changes

khatwanimohit force-pushed the mohit/ckpt_reorg branch from 676bcaf to 80dead8 Compare January 29, 2025 01:00

khatwanimohit added 2 commits January 29, 2025 19:16

Convert Gemma2 orbax ckpt to HF

9da113e

refactor tpu e2e test files

53fec32

khatwanimohit force-pushed the mohit/ckpt_reorg branch from 80dead8 to 53fec32 Compare January 31, 2025 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor tpu e2e test files #1186

Refactor tpu e2e test files #1186

khatwanimohit commented Jan 22, 2025 •

edited

Loading

A9isha Jan 27, 2025

A9isha Jan 27, 2025

A9isha Jan 27, 2025

A9isha Jan 27, 2025

A9isha Jan 27, 2025

A9isha Jan 27, 2025

A9isha Jan 27, 2025

A9isha Jan 27, 2025

A9isha Jan 27, 2025

A9isha Jan 27, 2025

A9isha Jan 28, 2025

		@@ -0,0 +1,35 @@
		#!/bin/bash

		# This file runs once a day on a CPU and has follows:


		gcloud storage cp -r /tmp/hf_llama2 ${HF_CHECKPOINT}

		echo "All Checkpoints saved with RUN_ID=${RUN_ID}"


		# gcloud storage cp -r /tmp/hf_llama ${HF_CHECKPOINT}

		echo "All Checkpoints saved with RUN_ID=${RUN_ID}"


		# Generate unscanned ckpt for efficient decoding test
		JAX_PLATFORMS=cpu python MaxText/generate_param_only_checkpoint.py MaxText/configs/base.yml async_checkpointing=false base_output_directory=${UNSCANNED_CHECKPOINT} load_parameters_path=${SCANNED_CHECKPOINT} run_name=unscanned model_name='mistral-7b' force_unroll=true

Refactor tpu e2e test files #1186

Are you sure you want to change the base?

Refactor tpu e2e test files #1186

Conversation

khatwanimohit commented Jan 22, 2025 • edited Loading

Description

Tests

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khatwanimohit commented Jan 22, 2025 •

edited

Loading