clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Updates

(February 2024): We have updated the framework code. If you have written games using the initial release version, see this guide on how to update your game.

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

The cLLM (chat-optimized Large Language Model, "clem") framework tests such models' ability to engage in games – rule-constituted activities played using language. The framework is a systematic way of probing for the situated language understanding of language using agents.

This repository contains the code for setting up the framework and implements a number of games that are further discussed in

Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents (arXiv:2305.13455). arXiv. https://doi.org/10.48550/arXiv.2305.13455

Evaluation Results

On the main project website , under leaderboard.

Game details

A Simple Word Game: taboo
A Word-Guessing Game Based on Clues: wordle
Drawing Instruction Giving and Following: image
An ASCII Picture Reference Game: reference
Scorekeeping: private and shared

Using the benchmark

This repository is tested on Python 3.8+

We welcome you to contribute to or extend the benchmark with your own games and models. Please simply open a pull request. You can find more information on how to use the benchmark in the links below.

Running openchat3.6-8b

In the model_registry.json file, openchat3.6-8b was added as an openai-compatible backend. While the model can't be run through the OpenAI API, you can use it via LMStudio by configuring the key.json as follows: "generic_openai_compatible": { "api_key": "not-needed", "base_url": "http://localhost:1234/v1" }

After that, enable developer mode in LMStudio to create a local server.

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
backends		backends
cheat_sheet		cheat_sheet
clemgame		clemgame
docs		docs
evaluation		evaluation
games		games
results		results
results_eval		results_eval
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat-two-tracks.css		chat-two-tracks.css
key.json.template		key.json.template
logging.yaml		logging.yaml
move_files2.py		move_files2.py
moving_files.py		moving_files.py
pipeline_clembench.sh		pipeline_clembench.sh
pipeline_huggingfaces.sh		pipeline_huggingfaces.sh
pipeline_llama2_hf.sh		pipeline_llama2_hf.sh
prepare_path.sh		prepare_path.sh
requirements.txt		requirements.txt
requirements_hf.txt		requirements_hf.txt
results.zip		results.zip
run.sh		run.sh
setup.sh		setup.sh
setup_hf.sh		setup_hf.sh
setup_llamacpp_cuda122.sh		setup_llamacpp_cuda122.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Updates

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Evaluation Results

Game details

Using the benchmark

Running openchat3.6-8b

About

Releases

Packages

Languages

License

JPasniczek/rizzSim

Folders and files

Latest commit

History

Repository files navigation

Updates

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Evaluation Results

Game details

Using the benchmark

Running openchat3.6-8b

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages