Mixtral_Inference

1. Prerequisites :

Install Docker & Nvidia Docker. Follow Link
Make sure all 8xNvidia H100 GPUs are visible.
Check GPUs status with Command : nvidia-smi

2. Setup TRT-LLM Docker Container :

Replace this with your Work Space Path. Minimum Disk Space Required : 300GB

export HOSTSPACE="/mnt/Scratch_space"

sudo docker run --runtime=nvidia --name=TensorRT_LLM_8xGPU_CUDA_12.6.0 --gpus=all --entrypoint /bin/bash \
                --net=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --cap-add=SYS_PTRACE \
                --cap-add=SYS_ADMIN --cap-add=DAC_READ_SEARCH --security-opt seccomp=unconfined -it \
                -v ${HOSTSPACE}:/home/user -w /home/user nvidia/cuda:12.6.0-cudnn-devel-ubuntu22.04

3. Install Dependencies ( Inside Docker ) :

chmod 1777 /tmp
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs vim

pip3 install numpy==1.26.4

4. Install TRT-LLM :

Unzip the "Mixtral_Inference.zip" folder to /home/user/Mixtral_Inference

git clone https://github.com/ajithAI/Mixtral_Inference.git
cd /home/user/Mixtral_Inference
git clone --recursive https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
pip install typing-extensions
pip install "cython<3"
pip install "pyyaml==5.4.1" --no-build-isolation
pip install -r requirements.txt
pip install --upgrade transformers
cd .. && chmod 777 *.sh
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

5. Check Installation :

python3 -c "import tensorrt_llm"  # < Prints TRT Version >

[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081000

6. Download Mixtral 8x7B Model from HuggingFace :

Use Huggingface Login Credentials to download the Mixtral Model.

apt-get install git-lfs 
git lfs install 
git clone https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 /home/user/Mixtral-8x7B-v0.1

7. Run Quantization & Create Checkpoint :

export DIR=/home/user/Mixtral_Inference && cd $DIR && mkdir $DIR/Checkpoints

cd TensorRT-LLM/examples
python3 quantization/quantize.py --model_dir /home/user/Mixtral-8x7B-v0.1 --dtype float16 \
      --qformat fp8 --kv_cache_dtype fp8 --calib_size 512 --tp_size 8 \
      --output_dir $DIR/Checkpoints/Mixtral_8x7B_v0.1_Checkpoint_FP8_8xGPU_CUDA_12.6_TRT_LLM_0.13

8. Create TRT Engine :

export DIR=/home/user/Mixtral_Inference && cd $DIR && mkdir $DIR/TRT_Engines

trtllm-build --checkpoint_dir $DIR/Checkpoints/Mixtral_8x7B_v0.1_Checkpoint_FP8_8xGPU_CUDA_12.6_TRT_LLM_0.13 \
             --output_dir $DIR/TRT_Engines/Mixtral_8x7B_v0.1_TRT_Engine_FP8_8xGPU_MaxBatch_8192_MaxSeqLen_4096_CUDA_12.6_TRT_LLM_0.13_TP_8 \
             --gemm_plugin auto --use_paged_context_fmha enable --use_fused_mlp enable --use_fp8_context_fmha enable --workers 8 \
             --max_batch_size 8192 --max_input_len 2048 --max_seq_len 4096

9. To Run Benchmark inside Docker :

USAGE : <RUN_SCRIPT> <BATCH_SIZE> <INPUT_LENGTH> <OUTPUT_LENGTH> <WARMUP_ITER> <BENCHMARK_ITER> <OUTPUT_FILENAME>

./run_mixtral_inference.sh 64 2048 2048 25 75 Mixtral_TRT_Batch_64_Input_2048_Output_2048
./run_mixtral_inference.sh 96 2048 128 50 200 Mixtral_TRT_Batch_96_Input_2048_Output_128
./run_mixtral_inference.sh 1024 128 128 50 150 Mixtral_TRT_Batch_1024_Input_128_Output_128
./run_mixtral_inference.sh 1024 128 2048 5 25 Mixtral_TRT_Batch_1024_Input_128_Output_2048

./run_mixtral_inference.sh 64 2048 1 50 250 Mixtral_TRT_Batch_64_Input_2048_Output_1
./run_mixtral_inference.sh 96 2048 1 50 250 Mixtral_TRT_Batch_96_Input_2048_Output_1
./run_mixtral_inference.sh 1024 128 1 50 250 Mixtral_TRT_Batch_1024_Input_128_Output_1

10. To Run Benchmark from Docker Outside :

cd ${HOSTSPACE}/Mixtral_Inference
bash ./docker_run_benchmark.sh

11. Error Handlings :

For Error : CUDA initialization: Unexpected error from cudaGetDeviceCount()

sudo systemctl stop nvidia-fabricmanager
sudo systemctl restart nvidia-fabricmanager
sudo systemctl status nvidia-fabricmanager

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
README.md		README.md
copy.sh		copy.sh
dataset_llama_1024_2048_len.txt		dataset_llama_1024_2048_len.txt
docker_run_benchmark.sh		docker_run_benchmark.sh
docker_run_mixtral_inference.sh		docker_run_mixtral_inference.sh
requirements.txt		requirements.txt
run_bm.py		run_bm.py
run_clean.sh		run_clean.sh
run_mixtral_inference.sh		run_mixtral_inference.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mixtral_Inference

1. Prerequisites :

2. Setup TRT-LLM Docker Container :

Replace this with your Work Space Path. Minimum Disk Space Required : 300GB

3. Install Dependencies ( Inside Docker ) :

4. Install TRT-LLM :

5. Check Installation :

6. Download Mixtral 8x7B Model from HuggingFace :

Use Huggingface Login Credentials to download the Mixtral Model.

7. Run Quantization & Create Checkpoint :

8. Create TRT Engine :

9. To Run Benchmark inside Docker :

10. To Run Benchmark from Docker Outside :

11. Error Handlings :

For Error : CUDA initialization: Unexpected error from cudaGetDeviceCount()

About

Releases

Packages

Languages

ajithAI/Mixtral_Inference

Folders and files

Latest commit

History

Repository files navigation

Mixtral_Inference

1. Prerequisites :

2. Setup TRT-LLM Docker Container :

Replace this with your Work Space Path. Minimum Disk Space Required : 300GB

3. Install Dependencies ( Inside Docker ) :

4. Install TRT-LLM :

5. Check Installation :

6. Download Mixtral 8x7B Model from HuggingFace :

Use Huggingface Login Credentials to download the Mixtral Model.

7. Run Quantization & Create Checkpoint :

8. Create TRT Engine :

9. To Run Benchmark inside Docker :

10. To Run Benchmark from Docker Outside :

11. Error Handlings :

For Error : CUDA initialization: Unexpected error from cudaGetDeviceCount()

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages