-
Notifications
You must be signed in to change notification settings - Fork 240
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Introduce
evaluate_experiment
method (#630)
* [OPIK-402]: handle project name and score definitions for evaluate(); * [OPIK-402]: linter; * [OPIK-402]: remove the default value for log_scores project_name * [OPIK-402]: run linter; * Introduce a evaluate_existing method * Add proper support for project_name * Add proper support for project_name * Update following code review * Update following code review * Update following code review --------- Co-authored-by: Sasha <[email protected]>
- Loading branch information
Showing
11 changed files
with
313 additions
and
5 deletions.
There are no files selected for viewing
104 changes: 104 additions & 0 deletions
104
.../opik-documentation/documentation/docs/evaluation/update_existing_experiment.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
--- | ||
sidebar_label: Update an existing experiment | ||
--- | ||
|
||
# Update an existing experiment | ||
|
||
Sometimes you may want to update an existing experiment with new scores, or update existing scores for an experiment. You can do this using the [`evaluate_experiment` function](https://www.comet.com/docs/opik/python-sdk-reference/evaluation/evaluate_existing.html). | ||
|
||
This function will re-run the scoring metrics on the existing experiment items and update the scores: | ||
|
||
```python | ||
from opik.evaluation import evaluate_experiment | ||
from opik.evaluation.metrics import Hallucination | ||
|
||
hallucination_metric = Hallucination() | ||
|
||
# Replace "my-experiment" with the name of your experiment which can be found in the Opik UI | ||
evaluate_experiment(experiment_name="my-experiment", scoring_metrics=[hallucination_metric]) | ||
``` | ||
|
||
:::tip | ||
The `evaluate_experiment` function can be used to update existing scores for an experiment. If you use a scoring metric with the same name as an existing score, the scores will be updated with the new values. | ||
::: | ||
|
||
## Example | ||
|
||
### Create an experiment | ||
|
||
Suppose you are building a chatbot and want to compute the hallucination scores for a set of example conversations. For this you would create a first experiment with the `evaluate` function: | ||
|
||
```python | ||
from opik import Opik, track | ||
from opik.evaluation import evaluate | ||
from opik.evaluation.metrics import Equals, Hallucination | ||
from opik.integrations.openai import track_openai | ||
import openai | ||
|
||
# Define the task to evaluate | ||
openai_client = track_openai(openai.OpenAI()) | ||
|
||
MODEL = "gpt-3.5-turbo" | ||
|
||
@track | ||
def your_llm_application(input: str) -> str: | ||
response = openai_client.chat.completions.create( | ||
model=MODEL, | ||
messages=[{"role": "user", "content": input}], | ||
) | ||
|
||
return response.choices[0].message.content | ||
|
||
# Define the evaluation task | ||
def evaluation_task(x): | ||
return { | ||
"input": x['user_question'], | ||
"output": your_llm_application(x['user_question']) | ||
} | ||
|
||
# Create a simple dataset | ||
client = Opik() | ||
try: | ||
dataset = client.create_dataset(name="your-dataset-name") | ||
dataset.insert([ | ||
{"input": {"user_question": "What is the capital of France?"}}, | ||
{"input": {"user_question": "What is the capital of Germany?"}}, | ||
]) | ||
except: | ||
dataset = client.get_dataset(name="your-dataset-name") | ||
|
||
# Define the metrics | ||
hallucination_metric = Hallucination() | ||
|
||
|
||
evaluation = evaluate( | ||
experiment_name="My experiment", | ||
dataset=dataset, | ||
task=evaluation_task, | ||
scoring_metrics=[hallucination_metric], | ||
experiment_config={ | ||
"model": MODEL | ||
} | ||
) | ||
|
||
experiment_name = evaluation.experiment_name | ||
print(f"Experiment name: {experiment_name}") | ||
``` | ||
|
||
:::tip | ||
Learn more about the `evaluate` function in our [LLM evaluation guide](/evaluation/evaluate_your_llm). | ||
::: | ||
|
||
### Update the experiment | ||
|
||
Once the first experiment is created, you realise that you also want to compute a moderation score for each example. You could re-run the experiment with new scoring metrics but this means re-running the output. Instead you can simply update the experiment with the new scoring metrics: | ||
|
||
```python | ||
from opik.evaluation import evaluate_experiment | ||
from opik.evaluation.metrics import Moderation | ||
|
||
hallucination_metric = Hallucination() | ||
moderation_metric = Moderation() | ||
|
||
evaluate_experiment(experiment_name=experiment_name, scoring_metrics=[moderation_metric]) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7 changes: 7 additions & 0 deletions
7
apps/opik-documentation/python-sdk-docs/source/Objects/Experiment.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Experiment | ||
========== | ||
|
||
.. autoclass:: opik.api_objects.experiment.experiment.Experiment | ||
:members: | ||
:inherited-members: | ||
:special-members: __init__ |
7 changes: 7 additions & 0 deletions
7
apps/opik-documentation/python-sdk-docs/source/Objects/ExperimentItem.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
ExperimentItem | ||
============== | ||
|
||
.. autoclass:: opik.api_objects.experiment.experiment_item.ExperimentItem | ||
:members: | ||
:inherited-members: | ||
:special-members: __init__ |
4 changes: 4 additions & 0 deletions
4
apps/opik-documentation/python-sdk-docs/source/evaluation/evaluate_experiment.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
evaluate_experiment | ||
=================== | ||
|
||
.. autofunction:: opik.evaluation.evaluate_experiment |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
from opik.evaluation import evaluate_experiment | ||
from opik.evaluation.metrics import base_metric, score_result | ||
|
||
|
||
class MyCustomMetric(base_metric.BaseMetric): | ||
def __init__(self, name: str): | ||
self.name = name | ||
|
||
def score(self, **ignored_kwargs): | ||
# Add you logic here | ||
|
||
return score_result.ScoreResult( | ||
value=10, name=self.name, reason="Optional reason for the score" | ||
) | ||
|
||
|
||
evaluate_experiment( | ||
experiment_name="surprised_herb_1466", | ||
scoring_metrics=[MyCustomMetric(name="custom-metric")], | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
from .evaluator import evaluate | ||
from .evaluator import evaluate, evaluate_experiment | ||
|
||
__all__ = ["evaluate"] | ||
__all__ = ["evaluate", "evaluate_experiment"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
from typing import List | ||
|
||
from opik.api_objects import opik_client | ||
from opik.evaluation import test_case | ||
|
||
from opik.rest_api.experiments.client import ExperimentPublic | ||
|
||
|
||
def get_experiment_by_name( | ||
client: opik_client.Opik, experiment_name: str | ||
) -> ExperimentPublic: | ||
experiments = client._rest_client.experiments.find_experiments(name=experiment_name) | ||
|
||
if len(experiments.content) == 0: | ||
raise ValueError(f"Experiment with name {experiment_name} not found") | ||
elif len(experiments.content) > 1: | ||
raise ValueError(f"Found multiple experiments with name {experiment_name}") | ||
|
||
experiment = experiments.content[0] | ||
return experiment | ||
|
||
|
||
def get_trace_project_name(client: opik_client.Opik, trace_id: str) -> str: | ||
# We first need to get the project_id for the trace | ||
traces = client.get_trace_content(id=trace_id) | ||
project_id = traces.project_id | ||
|
||
# Then we can get the project name | ||
project_metadata = client.get_project(id=project_id) | ||
return project_metadata.name | ||
|
||
|
||
def get_experiment_test_cases( | ||
client: opik_client.Opik, experiment_id: str, dataset_id: str | ||
) -> List[test_case.TestCase]: | ||
test_cases = [] | ||
page = 1 | ||
|
||
while True: | ||
experiment_items_page = ( | ||
client._rest_client.datasets.find_dataset_items_with_experiment_items( | ||
id=dataset_id, experiment_ids=f'["{experiment_id}"]', page=page | ||
) | ||
) | ||
if len(experiment_items_page.content) == 0: | ||
break | ||
|
||
for item in experiment_items_page.content: | ||
experiment_item = item.experiment_items[0] | ||
test_cases += [ | ||
test_case.TestCase( | ||
trace_id=experiment_item.trace_id, | ||
dataset_item_id=experiment_item.dataset_item_id, | ||
task_output=experiment_item.output, | ||
) | ||
] | ||
|
||
page += 1 | ||
|
||
return test_cases |