Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CRAG benchmark #88

Merged
merged 20 commits into from
Sep 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions evals/evaluation/agent_eval/crag_eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# CRAG Benchmark for Agent QnA systems
## Overview
[Comprehensive RAG (CRAG) benchmark](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024) was introduced by Meta in 2024 as a challenge in KDD conference. The CRAG benchmark has questions across five domains and eight question types, and provides a practical set-up to evaluate RAG systems. In particular, CRAG includes questions with answers that change from over seconds to over years; it considers entity popularity and covers not only head, but also torso and tail facts; it contains simple-fact questions as well as 7 types of complex questions such as comparison, aggregation and set questions to test the reasoning and synthesis capabilities of RAG solutions. Additionally, CRAG also provides mock APIs to query mock knowledge graphs so that developers can benchmark additional API calling capabilities for agents. Moreover, golden answers were provided in the dataset, which makes auto-evaluation with LLMs more robust. Therefore, CRAG benchmark is a realistic and comprehensive benchmark for agents.

## Getting started
1. Setup a work directory and download this repo into your work directory.
```
export $WORKDIR=<your-work-directory>
cd $WORKDIR
git clone https://github.com/opea-project/GenAIEval.git
```
2. Build docker image
```
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/docker/
bash build_image.sh
```
3. Set environment vars for downloading models from Huggingface
```
mkdir $WORKDIR/hf_cache
export HF_CACHE_DIR=$WORKDIR/hf_cache
export HF_HOME=$HF_CACHE_DIR
export HUGGINGFACEHUB_API_TOKEN=<your-hf-api-token>
```
4. Start docker container
This container will be used to preprocess dataset and run benchmark scripts.
```
bash launch_eval_container.sh
```

## CRAG dataset
1. Download original data and process it with commands below.
You need to create an account on the Meta CRAG challenge [website](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024). After login, go to this [link](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/problems/meta-kdd-cup-24-crag-end-to-end-retrieval-augmented-generation/dataset_files) and download the `crag_task_3_dev_v4.tar.bz2` file. Then make a `datasets` directory in your work directory using the commands below.
```
cd $WORKDIR
mkdir datasets
```
Then put the `crag_task_3_dev_v4.tar.bz2` file in the `datasets` directory, and decompress it by running the command below.
```
cd $WORKDIR/datasets
tar -xf crag_task_3_dev_v4.tar.bz2
```
2. Preprocess the CRAG data
Data preprocessing directly relates to the quality of retrieval corpus and thus can have significant impact on the agent QnA system. Here, we provide one way of preprocessing the data where we simply extracts all the web search snippets as-is from the dataset per domain. We also extract all the query-answer pairs along with other meta data per domain. You can run the command below to use our method. The data processing will take some time to finish.
```
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/preprocess_data
bash run_data_preprocess.sh
```
**Note**: This is an example of data processing. You can develop and optimize your own data processing for this benchmark.
3. Sample queries for benchmark
The CRAG dataset has more than 4000 queries, and running all of them can be very expensive and time-consuming. You can sample a subset for benchmark. Here we provide a script to sample up to 5 queries per question_type per dynamism in each domain. For example, we were able to get 92 queries from the music domain using the script.
```
bash run_sample_data.sh
```

## Launch agent QnA system
Here we showcase a RAG agent in GenAIExample repo. Please refer to the README in the [AgentQnA example](https://github.com/opea-project/GenAIExamples/tree/main/AgentQnA) for more details. </br>
**Please note**: This is an example. You can build your own agent systems using OPEA components, then expose your own systems as an endpoint for this benchmark.</br>
To launch the agent in our AgentQnA example, open another terminal and build images and launch agent system there.
1. Build images
```
export $WORKDIR=<your-work-directory>
cd $WORKDIR
git clone https://github.com/opea-project/GenAIExamples.git
cd GenAIExamples/AgentQnA/tests/
bash 1_build_images.sh
```
2. Start retrieval tool
```
bash 2_start_retrieval_tool.sh
```
3. Ingest data into vector database and validate retrieval tool
```
# As an example, we will use the index_data.py script in AgentQnA example.
# You can write your own script to ingest data.
# As an example, We will ingest the docs of the music domain.
# We will use the crag-eval docker container to run the index_data.py script.
# The index_data.py is a client script.
# it will send data-indexing requests to the dataprep server that is part of the retrieval tool.
# So you need to switch back to the terminal where the crag-eval container is running.
cd $WORKDIR/GenAIExamples/AgentQnA/retrieval_tool/
python3 index_data.py --host_ip $host_ip --filedir ${WORKDIR}/datasets/crag_docs/ --filename crag_docs_music.jsonl
```
4. Launch and validate agent endpoint
```
# Go to the terminal where you launched the AgentQnA example
cd $WORKDIR/GenAIExamples/AgentQnA/tests/
bash 4_launch_and_validate_agent.sh
```

## Run CRAG benchmark
Once you have your agent system up and running, the next step is to generate answers with agent. Change the variables in the script below and run the script. By default, it will run a sampled set of queries in music domain.
```
# Come back to the interactive crag-eval docker container
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark
bash run_generate_answer.sh
```

## Use LLM-as-judge to grade the answers
1. Launch llm endpoint with HF TGI: in another terminal, run the command below. By default, `meta-llama/Meta-Llama-3-70B-Instruct` is used as the LLM judge.
```
cd llm_judge
bash launch_llm_judge_endpoint.sh
```
2. Validate that the llm endpoint is working properly.
```
export host_ip=$(hostname -I | awk '{print $1}')
curl ${host_ip}:8085/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
```
And then go back to the interactive crag-eval docker, run command below.
```
# Inside the crag-eval container
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/
python3 test_llm_endpoint.py
```
3. Grade the answer correctness using LLM judge. We use `answer_correctness` metrics from [ragas](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py).
```
# Inside the crag-eval container
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark/
bash run_grading.sh
```
24 changes: 24 additions & 0 deletions evals/evaluation/agent_eval/crag_eval/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM ubuntu:22.04

WORKDIR /home/user

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
python3.11 \
python3-pip \
libpoppler-cpp-dev \
wget \
git \
poppler-utils \
libmkl-dev \
curl

COPY requirements.txt /home/user/requirements.txt

RUN pip install -r requirements.txt

RUN cd /home/user/ && \
git clone https://github.com/opea-project/GenAIEval.git

ENV PYTHONPATH=$PYTHONPATH:/home/user/GenAIEval/

WORKDIR /home/user
12 changes: 12 additions & 0 deletions evals/evaluation/agent_eval/crag_eval/docker/build_image.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

dockerfile=Dockerfile

docker build \
-f ${dockerfile} . \
-t crag-eval:latest \
--network=host \
--build-arg http_proxy=${http_proxy} \
--build-arg https_proxy=${https_proxy} \
--build-arg no_proxy=${no_proxy} \
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

volume=$WORKDIR
host_ip=$(hostname -I | awk '{print $1}')

docker run -it -v $volume:/home/user/ -e WORKDIR=/home/user -e HF_HOME=/home/user/hf_cache -e host_ip=$host_ip -e http_proxy=$http_proxy -e https_proxy=$https_proxy crag-eval:latest
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
datasets
evaluate
jieba
langchain-community
langchain-huggingface
pandas
ragas
sentence_transformers
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import argparse
import json
import os

import tqdm


def split_text(text, chunk_size=2000, chunk_overlap=400):
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
is_separator_regex=False,
separators=["\n\n", "\n", ".", "!"],
)
return text_splitter.split_text(text)


def process_html_string(text):
from bs4 import BeautifulSoup

# print(text)
soup = BeautifulSoup(text, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out

# get text
text_content = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text_content.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
final_text = "\n".join(chunk for chunk in chunks if chunk)
# print(final_text)
return final_text


def preprocess_data(input_file):
snippet = []
return_data = []
n = 0
with open(input_file, "r") as f:
for line in f:
data = json.loads(line)

# search results snippets --> retrieval corpus docs
docs = data["search_results"]

for doc in docs:
# chunks = split_text(doc['page_snippet'])
# for chunk in chunks:
# snippet.append({
# "query": data['query'],
# "domain": data['domain'],
# "doc":chunk})
snippet.append({"query": data["query"], "domain": data["domain"], "doc": doc["page_snippet"]})

# qa pairs without search results
output = {}
for k, v in data.items():
if k != "search_results":
output[k] = v
return_data.append(output)

n += 1
if n == 10:
break

return snippet, return_data


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--filedir", type=str, default=None)
parser.add_argument("--docout", type=str, default=None)
parser.add_argument("--qaout", type=str, default=None)
# parser.add_argument('--chunk_size', type=int, default=10000)
# parser.add_argument('--chunk_overlap', type=int, default=0)

args = parser.parse_args()

if not os.path.exists(args.docout):
os.makedirs(args.docout)

if not os.path.exists(args.qaout):
os.makedirs(args.qaout)

data_files = os.listdir(args.filedir)

qa_pairs = []
docs = []
for file in tqdm.tqdm(data_files):
file = os.path.join(args.filedir, file)
doc, data = preprocess_data(file)
docs.extend(doc)
qa_pairs.extend(data)

# group by domain
domains = ["finance", "music", "movie", "sports", "open"]

for domain in domains:
with open(os.path.join(args.docout, "crag_docs_" + domain + ".jsonl"), "w") as f:
for doc in docs:
if doc["doc"] != "" and doc["domain"] == domain:
f.write(json.dumps(doc) + "\n")

with open(os.path.join(args.qaout, "crag_qa_" + domain + ".jsonl"), "w") as f:
for d in qa_pairs:
if d["domain"] == domain:
f.write(json.dumps(d) + "\n")
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FILEDIR=$WORKDIR/datasets/crag_task_3_dev_v4
DOCOUT=$WORKDIR/datasets/crag_docs/
QAOUT=$WORKDIR/datasets/crag_qas/

python3 process_data.py --filedir $FILEDIR --docout $DOCOUT --qaout $QAOUT
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FILEDIR=$WORKDIR/datasets/crag_qas

python3 sample_data.py --filedir $FILEDIR
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import argparse
import json
import os

import pandas as pd
import tqdm


def sample_data(input_file, output_file):
df = pd.read_json(input_file, lines=True, convert_dates=False)
# group by `question_type` and `static_or_dynamic`
df_grouped = df.groupby(["question_type", "static_or_dynamic"])
# sample 5 rows from each group if there are more than 5 rows else return all rows
df_sampled = df_grouped.apply(lambda x: x.sample(5) if len(x) > 5 else x)
# save sampled data to output file
df_sampled.to_json(output_file, orient="records", lines=True)


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--filedir", type=str, default=None)

args = parser.parse_args()

data_files = os.listdir(args.filedir)
for file in tqdm.tqdm(data_files):
print(file)
file = os.path.join(args.filedir, file)
output_file = file.replace(".jsonl", "_sampled.jsonl")
sample_data(file, output_file)
Loading