Skip to content

AudioLLMs/AudioBench

Repository files navigation

Prometheus-Logo

🔥 AudioBench 🔥

arXiv Hugging Face Organization License

⚡ A repository for evaluating AudioLLMs in various tasks 🚀 ⚡
⚡ AudioBench: A Universal Benchmark for Audio Large Language Models 🚀 ⚡
🌟 Come to View Our Live Leaderboard on Huggingface Space 🌟

Change log

  • JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
  • JAN 2025: Support 10+ MNSC - Singlish Understanding datasets.
  • DEC 2024: Support More (35) datasets / More Models (2 cascade and 3 fusion models).
  • SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
  • AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
  • AUG 2024: Leaderboard is live. Check it out here.
  • JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
  • JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.

🔧 Installation

Installation with pip:

pip install -r requirements.txt

For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.

⏩ Quick Start

The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.

# Step 1:
# Server the judgement model using VLLM framework (my example is using int4 quantized version)
# This requires with 1 * 80GB GPU
bash vllm_model_judge_llama_3_70b.sh

# Step 2:
# We perform model inference and obtain the evaluation results with the second GPU
GPU=2
BATCH_SIZE=1
OVERWRITE=True
NUMBER_OF_SAMPLES=-1 # indicate all test samples if number_of_samples=-1

MODEL_NAME=Qwen2-Audio-7B-Instruct

DATASET=cn_college_listen_mcq_test
METRICS=llama3_70b_judge_binary

bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES

How to Evaluation AudioBench Supported Datasets?

That's as simple as it can be. Replace the DATASET and METRIC name. A full list of supported datasets can be found: SUPPORTED DATASETS.

DATASET=librispeech_test_clean
METRIC=wer

How to Evaluation AudioBench Supported Models?

That's as simple as it can be. Replace the MODEL_NAME. A full list of supported datasets can be found: SUPPORTED MODELS.

MODEL_NAME=cascade_whisper_large_v3_llama_3_8b_instruct

How to Evaluation Your Models?

To evaluate on new models, please refer to adding_new_model.

How to Evaluation on Your Dataset?

Two simple steps:

  1. Add dataset loader and inference part. Example for cn_college_listen_mcq_test
  2. Edit dataset.py

📖 Citation

If you find our work useful, please consider citing our paper!

@article{wang2024audiobench,
  title={AudioBench: A Universal Benchmark for Audio Large Language Models},
  author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
  journal={NAACL},
  year={2025}
}

Researchers, companies or groups that are using AudioBench:

Releases

No releases published

Packages

No packages published