(February 2024): We have updated the framework code. If you have written games using the initial release version, see this guide on how to update your game.
clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents
The cLLM (chat-optimized Large Language Model, "clem") framework tests such models' ability to engage in games – rule-constituted activities played using language. The framework is a systematic way of probing for the situated language understanding of language using agents.
This repository contains the code for setting up the framework and implements a number of games that are further discussed in
Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents (arXiv:2305.13455). arXiv. https://doi.org/10.48550/arXiv.2305.13455
On the main project website , under leaderboard.
- A Simple Word Game: taboo
- A Word-Guessing Game Based on Clues: wordle
- Drawing Instruction Giving and Following: image
- An ASCII Picture Reference Game: reference
- Scorekeeping: private and shared
This repository is tested on Python 3.8+
We welcome you to contribute to or extend the benchmark with your own games and models. Please simply open a pull request. You can find more information on how to use the benchmark in the links below.
- How to run the benchmark and evaluation locally
- How to run the benchmark, update leaderboard workflow
- How to add a new model
- How to add and run your own game
- How to integrate with Slurk
In the model_registry.json file, openchat3.6-8b was added as an openai-compatible backend. While the model can't be run through the OpenAI API, you can use it via LMStudio by configuring the key.json as follows: "generic_openai_compatible": { "api_key": "not-needed", "base_url": "http://localhost:1234/v1" }
After that, enable developer mode in LMStudio to create a local server.