Authors: Nam V. Nguyen, Thong T. Doan, Luong Tran, Van Nguyen, Quang Pham
📢 Release Notes • 🚀 Quick Start • 📌 About • 🔧 Setup New MoE Layer • 🏋️♂️ Training • 🧪 Evaluation • 📌 Citation
Mixture of Experts (MoEs) plays an important role in the development of more efficient and effective large language models (LLMs). Due to the enormous resource requirements, studying large scale MoE algorithms remain in-accessible to many researchers. This work develops LibMoE, a comprehensive and modular framework to streamline the research, training, and evaluation of MoE algorithms. Built upon three core principles: (i) modular design, (ii) efficient training; (iii) comprehensive evaluation, LibMoE brings MoE in LLMs more accessible to a wide range of researchers by standardizing the training and evaluation pipelines. Using LibMoE, we extensively benchmarked five state-of-the-art MoE algorithms over three different LLMs and 11 datasets under the zero-shot setting. The results show that despite the unique characteristics, all MoE algorithms perform roughly similar when averaged across a wide range of tasks. With the modular design and extensive evaluation, we believe LibMoE will be invaluable for researchers to make meaningful progress towards the next generation of MoE and LLMs.
Date | Release Notes |
---|---|
2024-12-30 | - Release LibMoE v1.1: - Reduced training time by 70%, from approximately ~30h to ~9h. - Provides more detailed information on MoE algorithms, including balancing loss, z-loss, training time per step, FLOPs, language loss, total loss, aux loss, and more customizable metrics. - Updated balance_loss_coef and router_z_loss_coef for better performance More details. |
2024-11-04 | - New feature: Metric analysis for MoE algorithms, as detailed in the LibMoE paper ✅ |
2024-11-01 | - Released LibMoE v1.0 preprint report: Read Here ✅ - LibMoE webpage: Visit Here ✅ - Publicly available checkpoints ✅ |
We are making our entire experiment checkpoints publicly available to contribute to the community's research on the topic of Mixture of Experts (MoE). By reusing our checkpoints at the Pre-Training and Pre-FineTuning stages, we hope to help others save time and computational resources in their own experiments.
Method | Stage | Siglip 224 + Phi3.5 | Siglip 224 + Phi3 | CLIP 336 + Phi3 |
---|---|---|---|---|
Pre-Training | Link | Link | Link | |
Pre-FineTuning | Link | Link | Link | |
VIT 665K | SMoE-R | Link | Link | Link |
Cosine-R | Link | Link | Link | |
Sigmoid-R | Link | Link | Link | |
Hyper-R | Link | Link | Link | |
Perturbed Cosine-R | Link | Link | Link |
*VIT stands for Visual Instruction Tuning.
-
Clone this repository:
git clone https://github.com/Fsoft-AIC/LibMoE.git cd LibMoE
-
Install dependencies:
We used Python 3.9
venv
for all experiments, and it should be compatible with Python 3.9 or 3.10 under Anaconda if you prefer to use it.-
Using
venv
:python -m venv /path/to/new/virtual/moe source /path/to/new/virtual/moe/bin/activate
-
Using
Anaconda
:conda create -n moe python=3.9 -y conda activate moe
Then, install the required packages:
pip install --upgrade pip pip install -e . pip install -r ./requirements.txt
-
-
Install additional packages:
Choose the FlashAttention version based on your Torch version from the FlashAttention Releases.
Example:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
For a detailed, step-by-step guide on setting up the dataset, please refer to the dataset guide.
For a detailed step-by-step guide on setting up a new MoE layer, please refer to the model guide.
After downloading the datasets and the corresponding JSON files, you can proceed to train the model using the following commands. Below is an example using the Phi3
configuration.
Option 1: Run Each Stage Separately
-
Pre-train the MLP connector:
bash scripts/train/phi3mini/clip/pretrain_phi3.sh
-
Pre-finetune the whole model:
bash scripts/train/phi3mini/clip/pft_phi3mini.sh
-
Visual instruction tuning stage:
bash scripts/train/phi3mini/clip/sft_phi3mini.sh
Option 2: Run All Stages
You can run all stages in sequence with the following command:
bash scripts/train/run_train_all.sh
Note:
- These scripts are designed for training the model on a single node with 4x A100 GPUs.
- You must set the
batch_size
to the value specified in our scripts (/scripts/train/phi3mini/clip
) for each stage (batch_size = gradient_accumulation_steps * batch_size_current
).
Test Training the Model
We recommend running all stages with MAX_STEPS=2
to check for issues in each stage. This approach allows you to identify and fix problems quickly, ensuring a stable process. After testing, set MAX_STEPS=-1
to train all steps fully. Also, remember to delete the checkpoint folder that was created during testing.
#!/bin/bash
export TMPDIR=""
export TOOLKIT_DIR="" # Path to the toolkitmoe directory
export KEY_HF="" # Hugging Face API key
export ID_GPUS="0,1,2,3"
# Set to -1 to run all steps
export MAX_STEPS=2 # Select a suitable number of steps for testing each stage
echo "Starting pretrain stage"
bash ./scripts/train/phi3mini/pretrain_phi3.sh
echo "Starting pft stage"
bash ./scripts/train/phi3mini/pft_phi3mini.sh
echo "Starting sft stage"
bash ./scripts/train/phi3mini/sft_phi3mini.sh
We are evaluate multi-benchmark
- AI2D
- ChartQA
- Text VQA
- GQA
- HallusionBenchmark
- MathVista Validation
- MMBenchEN
- MME
- MMMU Validation
- MMStar
- POPE
- SQA IMG Full
To run the evaluation, use the following command:
bash scripts/eval/run_eval.sh
*Note: For the MathVista Validation and HallusionBenchmark, GPT-4 is used for evaluation. You need to provide an API key to perform the evaluation.
Evaluation of LLaVA on MME
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks mme \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme \
--output_path ./logs/ \
--return_id_experts true \ # return selected expert IDs
--layers_expert_selection 1 2 3 # define specific layers for expert selection; if no layer IDs are defined, all experts from all layers are selected by default
Evaluation of LLaVA on multiple datasets
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks mme,mmbench_en \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme_mmbenchen \
--output_path ./logs/ \
--return_id_experts true \ # return selected expert IDs
--layers_expert_selection 1 2 3 # define specific layers for expert selection; if no layer IDs are defined, all experts from all layers are selected by default
For other variants llava. Please change the conv_template
in the model_args
conv_template
is an arg of the init function of llava inlmms_eval/models/llava.py
, you could find the corresponding value at LLaVA's code, probably in a dict variableconv_templates
inmoe_model/conversation.py
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.6-mistral-7b,conv_template=mistral_instruct" \
--tasks mme,mmbench_en \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme_mmbenchen \
--output_path ./logs/ \
--return_id_experts true \ # return selected expert IDs
--layers_expert_selection 1 2 3 # define specific layers for expert selection; if no layer IDs are defined, all experts from all layers are selected by default
If you find this repository useful, please consider citing our paper:
@misc{nguyen2024libmoelibrarycomprehensivebenchmarking,
title={LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models},
author={Nam V. Nguyen and Thong T. Doan and Luong Tran and Van Nguyen and Quang Pham},
year={2024},
eprint={2411.00918},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.00918},
}