Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ragaaf (RAG assessment annotation free) #157

Merged
merged 28 commits into from
Oct 15, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
669997a
minimized required fields/columns in user data
adkakne Sep 19, 2024
80e2160
add bench-target as the prefix of output folder (#133)
daisy-ycguo Sep 19, 2024
eb98d2e
remove examples. (#135)
lkk12014402 Sep 19, 2024
ad58bd8
minor naming correction to maintain consistency
adkakne Sep 20, 2024
c49ea84
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 20, 2024
50d4167
Add hyperlinks and paths validation. (#132)
ZePan110 Sep 20, 2024
8b9e830
Merge branch 'main' into main
adkakne Sep 20, 2024
ecbe0d1
added support for older version of ragas
adkakne Sep 24, 2024
2db5af5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 24, 2024
7e35d0b
Merge branch 'main' into main
adkakne Sep 24, 2024
6cb0d6a
merge with OPEA main
adkakne Sep 25, 2024
1047133
testing automatic validation of ragas metrics
adkakne Sep 25, 2024
1a9d3e8
removing summarization_score metric
adkakne Sep 26, 2024
76790f3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 26, 2024
8f0b59b
upgrading ragas from 0.1.16 to 0.1.19
adkakne Sep 26, 2024
82ef382
Merge remote-tracking branch 'origin/main'
adkakne Sep 26, 2024
a2bd899
Merge branch 'main' into main
lvliang-intel Oct 7, 2024
d5c1da4
Merge remote-tracking branch 'upstream/main'
adkakne Oct 11, 2024
1c0a0f4
adding annotation free RAG assessment
adkakne Oct 11, 2024
0cfcebb
improved README
adkakne Oct 11, 2024
8c86ae9
adding jsonlines to requirements
adkakne Oct 11, 2024
e070c79
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 11, 2024
052431d
fulfilled feature request - allow unit test case for RAGAAF
adkakne Oct 11, 2024
ccc864b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 11, 2024
045eed6
removing extra inputs from evaluation
adkakne Oct 11, 2024
348e947
Merge branch 'ragaaf' of https://github.com/adkakne/GenAIEval into ra…
adkakne Oct 11, 2024
64493a4
correcting class name for unit test
adkakne Oct 11, 2024
c58f377
test needs local ID
adkakne Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions evals/metrics/ragaaf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# RAGAAF (RAG assessment - Annotation Free)

We introduce - RAGAAF, Intel's easy-to-use, flexible, opensource and annotation-free RAG evaluation tool using LLM-as-a-judge while benefitting from Intel's Gaudi2 AI accelator chips.

## Overview
### Data
RAGAAF is best suited for Long Form Question Answering (LFQA) datasets where you want to gauge quality and factualness of the answer via LLM's intelligence. Here, you can use benchmarking datasets or bring your own custom datasets. Please make sure to set `field_map` to map AutoEval fields such as "question" to your dataset's corresponding field like "query".
> Note : To use benchmarking datasets, set argument `data_mode=benchmarking`. Similarly, to use custom datasets, set `data_mode=local`.
### Model
AutoEval can run in 3 evaluation modes -
1. `evaluation_mode="endpoint"` uses HuggingFace endpoint.
- We recommend launching a HuggingFace endpoint on Gaudi AI accelerator machines to ensure maximum usage and performance.
- To launch HF endpoint on Gaudi2, please follow the 2-step instructions here - [tgi-gaudi](https://github.com/huggingface/tgi-gaudi).
- Pass your endpoint url as `model_name` argument.
2. `evaluation_mode="openai"` uses openai backend.
- Please set your `openai_key` and your choice of model as `model_name` argument.
3. `evaluation_mode="local"` uses your local hardware.
- Set `hf_token` argument and set your favourite open-source model in `model_name` argument.
- GPU usage will be prioritized after checking it's availability. If GPU is unavailable, the model will run on CPU.
## Metrics
AutoEval provides 4 metrics - factualness, correctness, relevance and readability. You can also bring your own metrics and grading scales. Don't forget to add your metric to `evaluation_metrics` argument.
## Generation configuration
We provide recommended generation parameters after experimenting with different LLMs. If you'd like to edit them to your requirement, please set generation parameters in `GENERATION_CONFIG` in `run_eval.py`.

## Run using HF endpoint
```python3
# step 1 : choose your dataset -- local or benchmarking
dataset = "explodinggradients/ragas-wikiqa"
data_mode = "benchmarking"
field_map = {"question": "question", "answer": "generated_with_rag", "context": "context"}

# step 2 - choose your favourite LLM and hardware

# evaluation_mode = "openai"
# model_name = "gpt-4o"
# openai_key = "<add your openai key>"

# evaluation_mode = "endpoint"
# model_name = f"http://{host_ip}:{port}"

evaluation_mode = "local"
model_name = "meta-llama/Llama-3.2-1B-Instruct"
hf_token = "<add your HF token>"

# step 3 - choose metrics of your choice, you can also add custom metrics
evaluation_metrics = ["factualness", "relevance", "correctness", "readability"]

# step 4 - run evaluation
evaluator = AnnotationFreeEvaluate(
dataset=dataset,
data_mode=data_mode,
field_map=field_map,
evaluation_mode=evaluation_mode,
model_name=model_name,
evaluation_metrics=evaluation_metrics,
# openai_key=openai_key,
hf_token=hf_token,
debug_mode=True,
)

responses = evaluator.measure()

for response in responses:
print(response)
```
That's it! For troubleshooting, please submit an issue and we will get right on it.
10 changes: 10 additions & 0 deletions evals/metrics/ragaaf/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

#

from .run_eval import AnnotationFreeEvaluate

__all__ = [AnnotationFreeEvaluate]
77 changes: 77 additions & 0 deletions evals/metrics/ragaaf/prompt_engineering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

from jinja2 import Template

from .prompt_templates import *
from .prompt_templates import NAME2METRIC


class Prompt:
"""Class to customize prompt template using user-defined list of metrics."""

def __init__(self, metrics, input_fields):
self.metrics = metrics
self.input_fields = input_fields
self.template = self.load_prompt_template()

def create_grading_format(self):
grading_format = (
"You must ALWAYS provide every single one of the scores and reasonings in the following JSON format:"
)
grading_format += "\n" + "{" + "\n"
content = []
reasoning_prompt = "Reasoning for {}: [your one line step by step reasoning about the {} of the answer]"
scoring_prompt = "Score for {}: [your score number for the {} of the answer]"
for metric in self.metrics:
reasoning = reasoning_prompt.format(metric, metric)
score = scoring_prompt.format(metric, metric)
content += (reasoning + "\n" + score,)
grading_format += "\n\n".join(content)
grading_format += "\n" + "}"
return grading_format

def create_closing_prompt(self):
closing_prompt = ["Let's begin!"]
for f in self.input_fields:
closing_prompt += ("Provided {}:".format(f) + "\n" + "{{" + f + "}}",)
return "\n\n".join(closing_prompt)

def load_prompt_template(self):
content = []
for metric_name in ["opening_prompt"] + self.metrics:
metric_instance = NAME2METRIC[metric_name]
content += (metric_instance.template,)
content += (self.create_grading_format(),)
content += (self.create_closing_prompt(),)
return Template("\n\n".join(content))

def render_prompt(self, **kwargs) -> str:
text = self.template.render(**kwargs)
return text


if __name__ == "__main__":

"""Here, we test implementation of Prompt class."""

# step 0 - user input
metrics = ["factualness", "relevance", "correctness", "readability"]
input_fields = ["question", "answer", "context"]

# step 1 - load prompt using Prompt class
prompt = Prompt(metrics=metrics, input_fields=input_fields)

example = {
"question": "Who is wife of Barak Obama",
"context": "Michelle Obama, wife of Barak Obama (former President of the United States of America) is an attorney. Barak and Michelle Obama have 2 daughters - Malia and Sasha",
"answer": "Michelle Obama",
"ground_truth": "Wife of Barak Obama is Michelle Obama",
}

# step 2 - render prompt with given inputs
rendered_prompt = prompt.render_prompt(
question=example["question"], answer=example["answer"], context=example["context"]
)

print(rendered_prompt)
21 changes: 21 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

from .opening_prompt import OpeningPrompt

from .correctness import Correctness
from .factualness import Factualness
from .relevance import Relevance
from .readability import Readability

__all__ = ["opening_prompt", "correctness", "factualness", "relevance", "readability"]

NAME2METRIC = {}


def snake2camel(s):
return "".join(x.capitalize() or "_" for x in s.split("_"))


for name in __all__:
NAME2METRIC[name] = eval(snake2camel(name))
13 changes: 13 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/correctness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class Correctness:
name = "correctness"
required_columns = ["answer", "context", "question"]
template = """- Correctness: correctness measures how accurately and comprehensively does the answer resolve problem posed in the question.
- Score 1: If the answer is empty string or something like "I do not know the answer", the correctness score is 1.
- Score 2: If the answer only addresses a small part of the question correctly or it is missing many critical steps/aspects of the answer or the answer is too short to fully answer the question or is missing many steps causing the answer to not fully address the problem described in the question, then the correctness score is 2.
- Score 3: The answer mostly addresses the question but one critical aspect/step is missing or is incorrect.
- Score 4: the answer mostly answer the question and covers all critical/main aspects of the question, but it’s missing important/necessary details about one or more aspects.
- Score 5: the answer correctly and completely addresses the query. It also covers important details about each step."""
13 changes: 13 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/factualness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class Factualness:
name = "factualness"
required_columns = ["answer", "context"]
template = """- Factualness: Factualness assesses how much of the provided answer is contained within the provided context. A higher score indicates that a higher proportion of claims present in the answer are present or can be derived from the provided context.
- Score 1: the answer is completely hallucinated i.e. not contained in the context at all or there is no answer.
- Score 2: only a small part of the answer is contained in the context but most of it is imaginary/hallucinated or the meaning is completely changed from what is represented in the context.
- Score 3: Only about half of the answer is contained in the context. Rest of the answer is hallucinated or imaginary.
- Score 4: Most of the claims in the answer can be inferred from the provided context with very little information that is not directly supported by the provided context.
- Score 5: All of the claims in the answer are directly supported by the provided context, demonstrating high faithfulness to the provided context."""
21 changes: 21 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/opening_prompt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class OpeningPrompt:
name = "opening_prompt"
required_columns = []

template = """Consider yourself as an helpful, truthful and impartial judge.

Your task:
You will be given an input consisting of a question, an answer and a context. Your task is to act as an impartial judge and provide a numerical score between 1 to 5 for each of the following metrics for the given answer.

Important rules for you while completing this task:
1. You MUST ALWAYS provide a score for every metric mentioned below.
2. Make sure to understand definition of every metric fully before completing your task. Every metric is provided with grading scale and rubric. You MUST use this grading scale and rubric to determine your score.
3. Ensure that your scores and reasoning for every metric is independent of each other e.g., score for factualness should not impact score for correctness and vice versa.
4. Base your grading decision only on the given inputs and do not speculate or hallucinate.
5. You must also provide reasoning for your score in a single sentence.

Your metric definitions along with grading scale and rubric:"""
13 changes: 13 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/readability.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class Readability:
name = "readability"
required_columns = ["answer"]
template = """- Readability: Readability measures clarity and lucidity of the answer. Readability is measured solely based on the answer and it does not consider the question or the context.
- Score 1: the answer is empty or "I do not know the answer" or completely unreadable or No meaningful information can be extracted from the answer, then the score is 1.
- Score 2: the answer is slightly readable, there are irrelevant symbols or HTML tags or repeated words, but it can roughly form a meaningful sentence that can cover some aspects of the answer.
- Score 3: Answer can be read but there are grammatical mistakes in the answer.
- Score 4: the answer readable, but the readability and style can improved to better appeal to the reader.
- Score 5: the answer is reader friendly and well written."""
13 changes: 13 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/relevance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class Relevance:
name = "relevance"
required_columns = ["question", "answer"]
template = """- Relevance: Relevance measures how well the answer relates to the question.
- Score 1: The answer doesn't mention anything about the question or is completely irrelevant to the question.
- Score 2: The answer only identifies the domain (e.g. cnvrg) mentioned in the question and provides information from the correct domain. But, the answer does not address the question itself and the point of the question is completely missed by it.
- Score 3: The answer correctly identifies the domain and essence of the question but the details in the answer are not relevant to the focus of the question.
- Score 4: The answer correctly identifies domain mentioned the question and essence of the question as well as stays consistent with both of them. But there is some part of the answer that is not relevant to the question or it's topic or it's essence. This irrelevant part is damaging the overall relevance of the answer.
- Score 5: The answer is completely relevant to the question and the details do not deviate from the essence of the question. There are no parts of the answer that are irrelevant or unnecessary for the given question."""
86 changes: 86 additions & 0 deletions evals/metrics/ragaaf/rag_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os

import jsonlines
from datasets import Dataset, load_dataset


class RAGDataset:
"""Dataset class to store data in HF datasets API format."""

def __init__(self, dataset, field_map, mode):
self.dataset = dataset
self.field_map = field_map
assert mode in ["local", "benchmarking"], "mode can be either local or benchmarking"
self.mode = mode
self.data = self.load_data()
self.validate_dataset()

def load_data(self):
if self.mode == "local":
assert os.path.exists(self.dataset), "There is no such file - {}".format(self.dataset)
with jsonlines.open(self.dataset) as reader:
data = []
for obj in reader:
ex = {}
for out_field, in_field in self.field_map.items():
if type(obj[in_field]) == list:
ex[out_field] = "\n".join(obj[in_field])
else:
ex[out_field] = obj[in_field]
data.append(ex)
return Dataset.from_list(data)
else:
data = []
for obj in load_dataset(self.dataset)["train"]:
ex = {}
for out_field, in_field in self.field_map.items():
if type(obj[in_field]) == list:
ex[out_field] = "\n".join(obj[in_field])
else:
ex[out_field] = obj[in_field]
data.append(ex)
return Dataset.from_list(data)

def validate_dataset(self):
for i, example in enumerate(self.data):
for out_field in self.field_map:
assert out_field in example, "Example {} does not have {} field".format(i + 1, out_field)

def __getitem__(self, index):
return self.data[index]

def __len__(self):
return len(self.data)

def __iter__(self):
return iter(self.data)


if __name__ == "__main__":

dataset_path = "../../benchmark/ragas/ground_truth.jsonl"
field_map = {
"question": "question",
"ground_truth": "ground_truth",
"context": "context",
}

ds = RAGDataset(dataset=dataset_path, field_map=field_map, mode="local")

for i, ex in enumerate(ds):
assert ex["question"] == ds[i]["question"], "index {} does not have correct query".format(i)

dataset = "explodinggradients/ragas-wikiqa"
field_map = {
"question": "question",
"answer": "generated_with_rag",
"context": "context",
"ground_truth": "correct_answer",
}
ds = RAGDataset(dataset=dataset, field_map=field_map, mode="benchmarking")

for i, ex in enumerate(ds):
assert ex["question"] == ds[i]["question"], "index {} does not have correct query".format(i)
Loading
Loading