EDGE LLM FINAL SUBMISSION

Introduction

This repository has preliminary submission details for Edge-Device Large Language Model Competition.

Name of the team and team members

Byte Crunchers

Prof. Chetan Singh Thakur
Madhuvanthi Srivatsav R
Arun C R
Sriya R G
Dhruva Kashyap

Track chosen: Compression challenge

Strategy:

For pruning qwen and llama models, we utilized Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods which adopts depth pruning that removes entire layers or blocks, while keeping the size of the remaining weights unchanged.
For finetuning, as mentioned in the starter kit of the competition, we used [c4 dataset] (https://huggingface.co/datasets/c4) and [alpaca dataset] (https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese)

Evaluating local models

Download models from Original_models folder in fp16 format using the provided link in model_checkpoint.txt into a folder.

The downloaded folders' content should look like

├── config.json # Configuration file for the model
├── generation_config.json # Generation-specific configuration
├── special_tokens_map.json # Mapping of special tokens
├── tokenizer_config.json # Configuration file for the tokenizer
├── tokenizer.json # Tokenizer data file
├── model.safetensors.index.json # Index file for the model weights
├── model-00001-of-00002.safetensors # Part 1 of model weights in Safetensors format
└── model-00002-of-00002.safetensors # Part 2 of model weights in Safetensors format

Running the tasks

Tasks on the downloaded models should be run using the following code

# remove -r latest if reusing previous examples is not intended
opencompass --datasets [name of the dataset] --hf-type chat (base for phi2) \
--models path_to_the downloaded_model_folder --debug \
--model-kwargs device_map='auto' trust_remote_code=True \
--batch-size 1 -r latest --max-out-len 1024 --max-num-workers 1

Commands we used

PHI

opencompass --datasets truthfulqa_gen commonsenseqa_7shot_cot_gen_734a22 gsm8k_gen humaneval_gen FewCLUE_chid_gen bbh_gen    --hf-type base --hf-path path_to_phi_model  --tokenizer-path microsoft/phi-2  --model-kwargs device_map='auto' trust_remote_code=True torch_dtype=torch.float16   --max-num-workers 4   --max-out-len 1024   -r latest   --batch-size 4 \

QWEN

opencompass --datasets truthfulqa_gen commonsenseqa_7shot_cot_gen_734a22 gsm8k_gen humaneval_gen FewCLUE_chid_gen bbh_gen    --hf-type chat   --hf-path path_to_qwen_15_model   --tokenizer-path Qwen/Qwen2-7B-Instruct  --model-kwargs device_map='auto' trust_remote_code=True torch_dtype=torch.float16   --max-num-workers 4   --max-out-len 1024   -r latest   --batch-size 4 \

LLAMA

opencompass --datasets truthfulqa_gen commonsenseqa_7shot_cot_gen_734a22 gsm8k_gen humaneval_gen FewCLUE_chid_gen bbh_gen    --hf-type chat   --hf-path path_to_llama_20_model   --tokenizer-path meta-llama/Meta-Llama-3.1-8B-Instruct   --model-kwargs device_map='auto' trust_remote_code=True torch_dtype=torch.float16   --max-num-workers 4   --max-out-len 1024   -r latest   --batch-size 4 \

Checking peak memory usage during inference

To check the peak memory usage during inference, we ran EvaluateMemoryandThroughput.py (provided in the starter kit)

python EvaluateThroughputAndMemory.py --model_name model_name(microsoft/phi-2 or  meta-llama/Meta-Llama-3.1-8B-Instruct or Qwen/Qwen2-7B-Instruct) --model_path path_to_the_downloaded_model

Submissions

Submitted models can be found in the links provided in model_checkpoint.txt and converted_models.txt
Original models (before MLC compilation) can be found in model_checkpoint.txt
MLC Compiled models can be found in converted_models.txt
Screenshots of the running app for phi2 has been provided in screenshots folder in this GitHub Repository
app-release.apk can be found in the link in app_release.txt

Results

##Repository on Results Summary

Model	Commonsenseqa_gen	FewChid_gen	bbh_gen	HumanEval	GSM8K	TruthfulQA	Memory	Throughput
Llama3	22.36	13.59	7.91	0.61	1.97	0.18	8.755 GB	71.61 inf/s 0.014 s
Qwen	0	0.45	1.88	0	0.99	0	7.184 GB	77.73 inf/s 0.0129 s
Phi	65.52	14.34	59.46	28.66	62.09	0.18	6.143 GB	13.53 inf/s 0.0739 s

Screenshots of the running app

Errors encountered

We encountered errors while compiling Llama and Qwen models. Refer to MLC COMPILATION ERRORS folder in this repository. We were able to compile all the models. However, Llama and Qwen fail at runtime due to out-of-memory issues on the target device. This comes despite our adherence to the competition rules, where the models in question utilize only around 9 GB of memory on the desktop (as reported in the table above).

Notes

hftype chat was used while generating results with batch size 1 and max output length 1024 for llama and qwen models and base type was used for phi2

System Configuration

CPU Configuration: 2P 96C Genoa AMD CPU
System RAM: 1.5 TB
System Storage: 48 TB
GPU Configuration: 4 x AMD MI210 (64 GB)
Operating System: Ubuntu 22.04 LTS
ROCm Version: 5.7

-The Throughput and memory evaluation numbers were obtained from Nvidia A100 GPU with the following specs:

CPU Configuration: AMD EPYC 7742 64-Core Processor
System RAM: 512 GB
GPU Configuration: 4 x Nvidia A100 (40 GB)
Operating System: Ubuntu 22.04.4 LTS
CUDA Version: 12.4

MLC compilation

Link for converted_model file can be found in saved_models.txt
Refer to the Submission Commands.txt

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
MLC RUN TIME ERRORS		MLC RUN TIME ERRORS
Application.jpg		Application.jpg
EvaluateMemoryandThroughput.py		EvaluateMemoryandThroughput.py
Phi-2.jpg		Phi-2.jpg
README.md		README.md
Results.csv		Results.csv
Submission Commands.txt		Submission Commands.txt
app_release.txt		app_release.txt
bundle_weight.py		bundle_weight.py
converted_models.txt		converted_models.txt
generate_csv.py		generate_csv.py
model_checkpoint.txt		model_checkpoint.txt
push_apk.sh		push_apk.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDGE LLM FINAL SUBMISSION

Introduction

Name of the team and team members

Byte Crunchers

Track chosen: Compression challenge

Strategy:

Evaluating local models

Running the tasks

Commands we used

PHI

QWEN

LLAMA

Checking peak memory usage during inference

Submissions

Results

Screenshots of the running app

Errors encountered

Notes

System Configuration

MLC compilation

About

Releases

Packages

Contributors 2

Languages

sriyachakravarthy/EDGE_LLM_FINAL_SUBMISSION

Folders and files

Latest commit

History

Repository files navigation

EDGE LLM FINAL SUBMISSION

Introduction

Name of the team and team members

Byte Crunchers

Track chosen: Compression challenge

Strategy:

Evaluating local models

Running the tasks

Commands we used

PHI

QWEN

LLAMA

Checking peak memory usage during inference

Submissions

Results

Screenshots of the running app

Errors encountered

Notes

System Configuration

MLC compilation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages