⚡ A repository for evaluating AudioLLMs in various tasks 🚀 ⚡
⚡ AudioBench: A Universal Benchmark for Audio Large Language Models 🚀 ⚡
🌟 Come to View Our Live Leaderboard on Huggingface Space 🌟
- JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
- JAN 2025: Support 10+ MNSC - Singlish Understanding datasets.
- DEC 2024: Support More (35) datasets / More Models (2 cascade and 3 fusion models).
- SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
- AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
- AUG 2024: Leaderboard is live. Check it out here.
- JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
- JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.
Installation with pip:
pip install -r requirements.txt
For model-as-judge evaluation, we serve the judgement model as a service via vllm
on port 5000
.
The example is hosting a Llama-3-70B-Instruct
model and running the cascade Whisper + Llama-3
model.
# Step 1:
# Server the judgement model using VLLM framework (my example is using int4 quantized version)
# This requires with 1 * 80GB GPU
bash vllm_model_judge_llama_3_70b.sh
# Step 2:
# We perform model inference and obtain the evaluation results with the second GPU
GPU=2
BATCH_SIZE=1
OVERWRITE=True
NUMBER_OF_SAMPLES=-1 # indicate all test samples if number_of_samples=-1
MODEL_NAME=Qwen2-Audio-7B-Instruct
DATASET=cn_college_listen_mcq_test
METRICS=llama3_70b_judge_binary
bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES
That's as simple as it can be. Replace the DATASET and METRIC name. A full list of supported datasets can be found: SUPPORTED DATASETS.
DATASET=librispeech_test_clean
METRIC=wer
That's as simple as it can be. Replace the MODEL_NAME. A full list of supported datasets can be found: SUPPORTED MODELS.
MODEL_NAME=cascade_whisper_large_v3_llama_3_8b_instruct
To evaluate on new models, please refer to adding_new_model.
Two simple steps:
- Add dataset loader and inference part. Example for cn_college_listen_mcq_test
- Edit dataset.py
If you find our work useful, please consider citing our paper!
@article{wang2024audiobench,
title={AudioBench: A Universal Benchmark for Audio Large Language Models},
author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
journal={NAACL},
year={2025}
}
- Llama3-S: When Llama Learns to Listen
- More to come...