Skip to content

Commit

Permalink
Introduce evaluate_experiment method (#630)
Browse files Browse the repository at this point in the history
* [OPIK-402]: handle project name and score definitions for evaluate();

* [OPIK-402]: linter;

* [OPIK-402]: remove the default value for log_scores project_name

* [OPIK-402]: run linter;

* Introduce a evaluate_existing method

* Add proper support for project_name

* Add proper support for project_name

* Update following code review

* Update following code review

* Update following code review

---------

Co-authored-by: Sasha <[email protected]>
  • Loading branch information
jverre and Sasha authored Nov 15, 2024
1 parent 734e9fd commit f5cb060
Show file tree
Hide file tree
Showing 11 changed files with 313 additions and 5 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
sidebar_label: Update an existing experiment
---

# Update an existing experiment

Sometimes you may want to update an existing experiment with new scores, or update existing scores for an experiment. You can do this using the [`evaluate_experiment` function](https://www.comet.com/docs/opik/python-sdk-reference/evaluation/evaluate_existing.html).

This function will re-run the scoring metrics on the existing experiment items and update the scores:

```python
from opik.evaluation import evaluate_experiment
from opik.evaluation.metrics import Hallucination

hallucination_metric = Hallucination()

# Replace "my-experiment" with the name of your experiment which can be found in the Opik UI
evaluate_experiment(experiment_name="my-experiment", scoring_metrics=[hallucination_metric])
```

:::tip
The `evaluate_experiment` function can be used to update existing scores for an experiment. If you use a scoring metric with the same name as an existing score, the scores will be updated with the new values.
:::

## Example

### Create an experiment

Suppose you are building a chatbot and want to compute the hallucination scores for a set of example conversations. For this you would create a first experiment with the `evaluate` function:

```python
from opik import Opik, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, Hallucination
from opik.integrations.openai import track_openai
import openai

# Define the task to evaluate
openai_client = track_openai(openai.OpenAI())

MODEL = "gpt-3.5-turbo"

@track
def your_llm_application(input: str) -> str:
response = openai_client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": input}],
)

return response.choices[0].message.content

# Define the evaluation task
def evaluation_task(x):
return {
"input": x['user_question'],
"output": your_llm_application(x['user_question'])
}

# Create a simple dataset
client = Opik()
try:
dataset = client.create_dataset(name="your-dataset-name")
dataset.insert([
{"input": {"user_question": "What is the capital of France?"}},
{"input": {"user_question": "What is the capital of Germany?"}},
])
except:
dataset = client.get_dataset(name="your-dataset-name")

# Define the metrics
hallucination_metric = Hallucination()


evaluation = evaluate(
experiment_name="My experiment",
dataset=dataset,
task=evaluation_task,
scoring_metrics=[hallucination_metric],
experiment_config={
"model": MODEL
}
)

experiment_name = evaluation.experiment_name
print(f"Experiment name: {experiment_name}")
```

:::tip
Learn more about the `evaluate` function in our [LLM evaluation guide](/evaluation/evaluate_your_llm).
:::

### Update the experiment

Once the first experiment is created, you realise that you also want to compute a moderation score for each example. You could re-run the experiment with new scoring metrics but this means re-running the output. Instead you can simply update the experiment with the new scoring metrics:

```python
from opik.evaluation import evaluate_experiment
from opik.evaluation.metrics import Moderation

hallucination_metric = Hallucination()
moderation_metric = Moderation()

evaluate_experiment(experiment_name=experiment_name, scoring_metrics=[moderation_metric])
```
1 change: 1 addition & 0 deletions apps/opik-documentation/documentation/sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ const sidebars: SidebarsConfig = {
"evaluation/concepts",
"evaluation/manage_datasets",
"evaluation/evaluate_your_llm",
"evaluation/update_existing_experiment",
{
type: "category",
label: "Metrics",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Experiment
==========

.. autoclass:: opik.api_objects.experiment.experiment.Experiment
:members:
:inherited-members:
:special-members: __init__
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
ExperimentItem
==============

.. autoclass:: opik.api_objects.experiment.experiment_item.ExperimentItem
:members:
:inherited-members:
:special-members: __init__
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
evaluate_experiment
===================

.. autofunction:: opik.evaluation.evaluate_experiment
3 changes: 3 additions & 0 deletions apps/opik-documentation/python-sdk-docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@ You can learn more about the `opik` python SDK in the following sections:

evaluation/Dataset
evaluation/evaluate
evaluation/evaluate_experiment
evaluation/metrics/index

.. toctree::
Expand All @@ -201,6 +202,8 @@ You can learn more about the `opik` python SDK in the following sections:
Objects/SpanData.rst
Objects/FeedbackScoreDict.rst
Objects/UsageDict.rst
Objects/Experiment.rst
Objects/ExperimentItem.rst
Objects/Prompt.rst

.. toctree::
Expand Down
20 changes: 20 additions & 0 deletions sdks/python/examples/evaluate_existing_experiment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from opik.evaluation import evaluate_experiment
from opik.evaluation.metrics import base_metric, score_result


class MyCustomMetric(base_metric.BaseMetric):
def __init__(self, name: str):
self.name = name

def score(self, **ignored_kwargs):
# Add you logic here

return score_result.ScoreResult(
value=10, name=self.name, reason="Optional reason for the score"
)


evaluate_experiment(
experiment_name="surprised_herb_1466",
scoring_metrics=[MyCustomMetric(name="custom-metric")],
)
4 changes: 2 additions & 2 deletions sdks/python/src/opik/evaluation/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from .evaluator import evaluate
from .evaluator import evaluate, evaluate_experiment

__all__ = ["evaluate"]
__all__ = ["evaluate", "evaluate_experiment"]
70 changes: 67 additions & 3 deletions sdks/python/src/opik/evaluation/evaluator.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
import time
from typing import List, Dict, Any, Optional
import time

from .types import LLMTask
from .metrics import base_metric
from .. import Prompt
from ..api_objects.dataset import dataset
from ..api_objects.experiment import experiment_item
from ..api_objects import opik_client

from . import tasks_scorer, scores_logger, report, evaluation_result
from . import tasks_scorer, scores_logger, report, evaluation_result, utils


def evaluate(
Expand Down Expand Up @@ -107,3 +106,68 @@ def evaluate(
test_results=test_results,
)
return evaluation_result_


def evaluate_experiment(
experiment_name: str,
scoring_metrics: List[base_metric.BaseMetric],
scoring_threads: int = 16,
verbose: int = 1,
) -> evaluation_result.EvaluationResult:
"""Update existing experiment with new evaluation metrics.
Args:
experiment_name: The name of the experiment to update.
scoring_metrics: List of metrics to calculate during evaluation.
Each metric has `score(...)` method, arguments for this method
are taken from the `task` output, check the signature
of the `score` method in metrics that you need to find out which keys
are mandatory in `task`-returned dictionary.
scoring_threads: amount of thread workers to run scoring metrics.
verbose: an integer value that controls evaluation output logs such as summary and tqdm progress bar.
"""
start_time = time.time()

client = opik_client.get_client_cached()

experiment = utils.get_experiment_by_name(
client=client, experiment_name=experiment_name
)

test_cases = utils.get_experiment_test_cases(
client=client, experiment_id=experiment.id, dataset_id=experiment.dataset_id
)

test_results = tasks_scorer.score(
test_cases=test_cases,
scoring_metrics=scoring_metrics,
workers=scoring_threads,
verbose=verbose,
)

first_trace_id = test_results[0].test_case.trace_id
project_name = utils.get_trace_project_name(client=client, trace_id=first_trace_id)

# Log scores - Needs to be updated to use the project name
scores_logger.log_scores(
client=client, test_results=test_results, project_name=project_name
)
total_time = time.time() - start_time

if verbose == 1:
report.display_experiment_results(
experiment.dataset_name, total_time, test_results
)

report.display_experiment_link(experiment.dataset_name, experiment.id)

evaluation_result_ = evaluation_result.EvaluationResult(
experiment_id=experiment.id,
experiment_name=experiment.name,
test_results=test_results,
)

return evaluation_result_
38 changes: 38 additions & 0 deletions sdks/python/src/opik/evaluation/tasks_scorer.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,3 +145,41 @@ def run(
]

return test_cases


def score(
test_cases: List[test_case.TestCase],
scoring_metrics: List[base_metric.BaseMetric],
workers: int,
verbose: int,
) -> List[test_result.TestResult]:
if workers == 1:
test_results = [
_score_test_case(test_case_=test_case_, scoring_metrics=scoring_metrics)
for test_case_ in tqdm.tqdm(
test_cases,
disable=(verbose < 1),
desc="Scoring",
total=len(test_cases),
)
]
else:
with futures.ThreadPoolExecutor(max_workers=workers) as pool:
test_case_futures = [
pool.submit(_score_test_case, test_case_, scoring_metrics)
for test_case_ in test_cases
]

test_results = [
test_case_future.result()
for test_case_future in tqdm.tqdm(
futures.as_completed(
test_case_futures,
),
disable=(verbose < 1),
desc="Evaluation",
total=len(test_case_futures),
)
]

return test_results
60 changes: 60 additions & 0 deletions sdks/python/src/opik/evaluation/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
from typing import List

from opik.api_objects import opik_client
from opik.evaluation import test_case

from opik.rest_api.experiments.client import ExperimentPublic


def get_experiment_by_name(
client: opik_client.Opik, experiment_name: str
) -> ExperimentPublic:
experiments = client._rest_client.experiments.find_experiments(name=experiment_name)

if len(experiments.content) == 0:
raise ValueError(f"Experiment with name {experiment_name} not found")
elif len(experiments.content) > 1:
raise ValueError(f"Found multiple experiments with name {experiment_name}")

experiment = experiments.content[0]
return experiment


def get_trace_project_name(client: opik_client.Opik, trace_id: str) -> str:
# We first need to get the project_id for the trace
traces = client.get_trace_content(id=trace_id)
project_id = traces.project_id

# Then we can get the project name
project_metadata = client.get_project(id=project_id)
return project_metadata.name


def get_experiment_test_cases(
client: opik_client.Opik, experiment_id: str, dataset_id: str
) -> List[test_case.TestCase]:
test_cases = []
page = 1

while True:
experiment_items_page = (
client._rest_client.datasets.find_dataset_items_with_experiment_items(
id=dataset_id, experiment_ids=f'["{experiment_id}"]', page=page
)
)
if len(experiment_items_page.content) == 0:
break

for item in experiment_items_page.content:
experiment_item = item.experiment_items[0]
test_cases += [
test_case.TestCase(
trace_id=experiment_item.trace_id,
dataset_item_id=experiment_item.dataset_item_id,
task_output=experiment_item.output,
)
]

page += 1

return test_cases

0 comments on commit f5cb060

Please sign in to comment.