Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework
Official Implementation
Paper | Project Page | Run Analysis Baseline
This repo contains the official implementation of our paper "Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework". You can find more details in our project page and our paper.
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework
Esteban Garces Arias,Hannah Blocher, Julian Rodemann, Meimingwei Li, Christian Heumann,Matthias Aßenmacher
Department of Statistics, LMU Munich, Munich Center for Machine Learning (MCML)
In this paper, we present novel ranking strategies within this multicriteria framework. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric designed to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Furthermore, we discuss the alignment of these approaches with human judgments. Our experiments demonstrate that the proposed methods offer a robust way to compare decoding strategies, exhibit similarities with human preferences, and serve as valuable tools in guiding model selection for open-ended text generation tasks. Finally, we suggest future directions for improving evaluation methodologies in text generation.
This repository contains:
- 🪐 A simple R implementation of Extended Bradley-Terry model
- ⚡️ Faster Metrics Calculation with Coherence, MAUVE and Diversity in Python
- 💥 A Colab notebook for running qstar analysis Demo in colab
Table of Contents 📖 [Back to Top]
Setup Environment 💻 [Back to Top]
To install all the dependencies for this repo, run the following command:
pip install -r requirements.txt
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True pip install simctg
We recommend you to build a new conda environment to use the repository.
conda create -n helmet python=3.11
conda activate helmet
pip install -r requirements.txt
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True pip install simctg
Data Download 📚 [Back to Top]
To download the data, please run the following command:
bash download_data.sh
Run Analysis Baseline 🔥 [Back to Top]
We open-sourced our pre-computed LLM inference results to reproduce our paper's analysis results. Please refer to the Data Download section to download the data. If you want to use your own inference results, please follow the following instructions.
To run the coherence computation for your own inference results json file, please run the following command:
python coherence_computation.py \
--opt_model_name $OPT_MODEL_NAME \
--test_path $TEST_PATH
To run the diversity computation for your own inference results json file, please run the following command:
python diversity_computation.py \
--test_path $TEST_PATH
To run the qstar analysis, please run the following command:
Rscript qstar_metric.R
You may need to modify the following lines in order to fit your data path:
# The following line locates at line 14, please change it to your results_with_pareto_efficiency.csv path
data_file <- "path/to/results_with_pareto_efficiency.csv"
# The following line locates at line 15, please change it to the path you want to save the 'ranking_qtext.csv'
candidate_stats_file <- "path/to/ranking_qtext.csv"
# The following line locates at line 16, please change it to the path you want to save the 'dominance_final_analysis.csv'
dominance_summary_file <- "path/to/dominance_final_analysis.csv"
Contributions 🚀 [Back to Top]
This repository is based on the following repositories:
- Contrastive Search versus Contrastive Decoding
- Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation
We thank the authors for their open-sourced code.