Introduce evaluate_experiment method (#630)

* [OPIK-402]: handle project name and score definitions for evaluate(); * [OPIK-402]: linter; * [OPIK-402]: remove the default value for log_scores project_name * [OPIK-402]: run linter; * Introduce a evaluate_existing method * Add proper support for project_name * Add proper support for project_name * Update following code review * Update following code review * Update following code review --------- Co-authored-by: Sasha <[email protected]>
comet-ml · Nov 15, 2024 · f5cb060 · f5cb060
1 parent 734e9fd
commit f5cb060
Show file tree

Hide file tree

Showing 11 changed files with 313 additions and 5 deletions.
diff --git a/.../opik-documentation/documentation/docs/evaluation/update_existing_experiment.md b/.../opik-documentation/documentation/docs/evaluation/update_existing_experiment.md
@@ -0,0 +1,104 @@
+---
+sidebar_label: Update an existing experiment
+---
+
+# Update an existing experiment
+
+Sometimes you may want to update an existing experiment with new scores, or update existing scores for an experiment. You can do this using the [`evaluate_experiment` function](https://www.comet.com/docs/opik/python-sdk-reference/evaluation/evaluate_existing.html).
+
+This function will re-run the scoring metrics on the existing experiment items and update the scores:
+
+```python
+from opik.evaluation import evaluate_experiment
+from opik.evaluation.metrics import Hallucination
+
+hallucination_metric = Hallucination()
+
+# Replace "my-experiment" with the name of your experiment which can be found in the Opik UI
+evaluate_experiment(experiment_name="my-experiment", scoring_metrics=[hallucination_metric])
+```
+
+:::tip
+The `evaluate_experiment` function can be used to update existing scores for an experiment. If you use a scoring metric with the same name as an existing score, the scores will be updated with the new values.
+:::
+
+## Example
+
+### Create an experiment
+
+Suppose you are building a chatbot and want to compute the hallucination scores for a set of example conversations. For this you would create a first experiment with the `evaluate` function:
+
+```python
+from opik import Opik, track
+from opik.evaluation import evaluate
+from opik.evaluation.metrics import Equals, Hallucination
+from opik.integrations.openai import track_openai
+import openai
+
+# Define the task to evaluate
+openai_client = track_openai(openai.OpenAI())
+
+MODEL = "gpt-3.5-turbo"
+
+@track
+def your_llm_application(input: str) -> str:
+    response = openai_client.chat.completions.create(
+        model=MODEL,
+        messages=[{"role": "user", "content": input}],
+    )
+
+    return response.choices[0].message.content
+
+# Define the evaluation task
+def evaluation_task(x):
+    return {
+        "input": x['user_question'],
+        "output": your_llm_application(x['user_question'])
+    }
+
+# Create a simple dataset
+client = Opik()
+try:
+    dataset = client.create_dataset(name="your-dataset-name")
+    dataset.insert([
+        {"input": {"user_question": "What is the capital of France?"}},
+        {"input": {"user_question": "What is the capital of Germany?"}},
+    ])
+except:
+    dataset = client.get_dataset(name="your-dataset-name")
+
+# Define the metrics
+hallucination_metric = Hallucination()
+
+
+evaluation = evaluate(
+    experiment_name="My experiment",
+    dataset=dataset,
+    task=evaluation_task,
+    scoring_metrics=[hallucination_metric],
+    experiment_config={
+        "model": MODEL
+    }
+)
+
+experiment_name = evaluation.experiment_name
+print(f"Experiment name: {experiment_name}")
+```
+
+:::tip
+Learn more about the `evaluate` function in our [LLM evaluation guide](/evaluation/evaluate_your_llm).
+:::
+
+### Update the experiment
+
+Once the first experiment is created, you realise that you also want to compute a moderation score for each example. You could re-run the experiment with new scoring metrics but this means re-running the output. Instead you can simply update the experiment with the new scoring metrics:
+
+```python
+from opik.evaluation import evaluate_experiment
+from opik.evaluation.metrics import Moderation
+
+hallucination_metric = Hallucination()
+moderation_metric = Moderation()
+
+evaluate_experiment(experiment_name=experiment_name, scoring_metrics=[moderation_metric])
+```
diff --git a/apps/opik-documentation/documentation/sidebars.ts b/apps/opik-documentation/documentation/sidebars.ts
@@ -62,6 +62,7 @@ const sidebars: SidebarsConfig = {
         "evaluation/concepts",
         "evaluation/manage_datasets",
         "evaluation/evaluate_your_llm",
+        "evaluation/update_existing_experiment",
         {
           type: "category",
           label: "Metrics",

diff --git a/apps/opik-documentation/python-sdk-docs/source/Objects/Experiment.rst b/apps/opik-documentation/python-sdk-docs/source/Objects/Experiment.rst
@@ -0,0 +1,7 @@
+Experiment
+==========
+
+.. autoclass:: opik.api_objects.experiment.experiment.Experiment
+    :members:
+    :inherited-members:
+    :special-members: __init__
diff --git a/apps/opik-documentation/python-sdk-docs/source/Objects/ExperimentItem.rst b/apps/opik-documentation/python-sdk-docs/source/Objects/ExperimentItem.rst
@@ -0,0 +1,7 @@
+ExperimentItem
+==============
+
+.. autoclass:: opik.api_objects.experiment.experiment_item.ExperimentItem
+    :members:
+    :inherited-members:
+    :special-members: __init__
diff --git a/apps/opik-documentation/python-sdk-docs/source/evaluation/evaluate_experiment.rst b/apps/opik-documentation/python-sdk-docs/source/evaluation/evaluate_experiment.rst
@@ -0,0 +1,4 @@
+evaluate_experiment
+===================
+
+.. autofunction:: opik.evaluation.evaluate_experiment
diff --git a/apps/opik-documentation/python-sdk-docs/source/index.rst b/apps/opik-documentation/python-sdk-docs/source/index.rst
@@ -176,6 +176,7 @@ You can learn more about the `opik` python SDK in the following sections:
 
    evaluation/Dataset
    evaluation/evaluate
+   evaluation/evaluate_experiment
    evaluation/metrics/index
 
 .. toctree::
@@ -201,6 +202,8 @@ You can learn more about the `opik` python SDK in the following sections:
    Objects/SpanData.rst
    Objects/FeedbackScoreDict.rst
    Objects/UsageDict.rst
+   Objects/Experiment.rst
+   Objects/ExperimentItem.rst
    Objects/Prompt.rst
 
 .. toctree::

diff --git a/sdks/python/examples/evaluate_existing_experiment.py b/sdks/python/examples/evaluate_existing_experiment.py
@@ -0,0 +1,20 @@
+from opik.evaluation import evaluate_experiment
+from opik.evaluation.metrics import base_metric, score_result
+
+
+class MyCustomMetric(base_metric.BaseMetric):
+    def __init__(self, name: str):
+        self.name = name
+
+    def score(self, **ignored_kwargs):
+        # Add you logic here
+
+        return score_result.ScoreResult(
+            value=10, name=self.name, reason="Optional reason for the score"
+        )
+
+
+evaluate_experiment(
+    experiment_name="surprised_herb_1466",
+    scoring_metrics=[MyCustomMetric(name="custom-metric")],
+)
diff --git a/sdks/python/src/opik/evaluation/__init__.py b/sdks/python/src/opik/evaluation/__init__.py
@@ -1,3 +1,3 @@
-from .evaluator import evaluate
+from .evaluator import evaluate, evaluate_experiment
 
-__all__ = ["evaluate"]
+__all__ = ["evaluate", "evaluate_experiment"]
diff --git a/sdks/python/src/opik/evaluation/evaluator.py b/sdks/python/src/opik/evaluation/evaluator.py
@@ -1,14 +1,13 @@
-import time
 from typing import List, Dict, Any, Optional
+import time
 
 from .types import LLMTask
 from .metrics import base_metric
 from .. import Prompt
 from ..api_objects.dataset import dataset
 from ..api_objects.experiment import experiment_item
 from ..api_objects import opik_client
-
-from . import tasks_scorer, scores_logger, report, evaluation_result
+from . import tasks_scorer, scores_logger, report, evaluation_result, utils
 
 
 def evaluate(
@@ -107,3 +106,68 @@ def evaluate(
         test_results=test_results,
     )
     return evaluation_result_
+
+
+def evaluate_experiment(
+    experiment_name: str,
+    scoring_metrics: List[base_metric.BaseMetric],
+    scoring_threads: int = 16,
+    verbose: int = 1,
+) -> evaluation_result.EvaluationResult:
+    """Update existing experiment with new evaluation metrics.
+
+    Args:
+        experiment_name: The name of the experiment to update.
+
+        scoring_metrics: List of metrics to calculate during evaluation.
+            Each metric has `score(...)` method, arguments for this method
+            are taken from the `task` output, check the signature
+            of the `score` method in metrics that you need to find out which keys
+            are mandatory in `task`-returned dictionary.
+
+        scoring_threads: amount of thread workers to run scoring metrics.
+
+        verbose: an integer value that controls evaluation output logs such as summary and tqdm progress bar.
+    """
+    start_time = time.time()
+
+    client = opik_client.get_client_cached()
+
+    experiment = utils.get_experiment_by_name(
+        client=client, experiment_name=experiment_name
+    )
+
+    test_cases = utils.get_experiment_test_cases(
+        client=client, experiment_id=experiment.id, dataset_id=experiment.dataset_id
+    )
+
+    test_results = tasks_scorer.score(
+        test_cases=test_cases,
+        scoring_metrics=scoring_metrics,
+        workers=scoring_threads,
+        verbose=verbose,
+    )
+
+    first_trace_id = test_results[0].test_case.trace_id
+    project_name = utils.get_trace_project_name(client=client, trace_id=first_trace_id)
+
+    # Log scores - Needs to be updated to use the project name
+    scores_logger.log_scores(
+        client=client, test_results=test_results, project_name=project_name
+    )
+    total_time = time.time() - start_time
+
+    if verbose == 1:
+        report.display_experiment_results(
+            experiment.dataset_name, total_time, test_results
+        )
+
+    report.display_experiment_link(experiment.dataset_name, experiment.id)
+
+    evaluation_result_ = evaluation_result.EvaluationResult(
+        experiment_id=experiment.id,
+        experiment_name=experiment.name,
+        test_results=test_results,
+    )
+
+    return evaluation_result_
diff --git a/sdks/python/src/opik/evaluation/tasks_scorer.py b/sdks/python/src/opik/evaluation/tasks_scorer.py
@@ -145,3 +145,41 @@ def run(
         ]
 
     return test_cases
+
+
+def score(
+    test_cases: List[test_case.TestCase],
+    scoring_metrics: List[base_metric.BaseMetric],
+    workers: int,
+    verbose: int,
+) -> List[test_result.TestResult]:
+    if workers == 1:
+        test_results = [
+            _score_test_case(test_case_=test_case_, scoring_metrics=scoring_metrics)
+            for test_case_ in tqdm.tqdm(
+                test_cases,
+                disable=(verbose < 1),
+                desc="Scoring",
+                total=len(test_cases),
+            )
+        ]
+    else:
+        with futures.ThreadPoolExecutor(max_workers=workers) as pool:
+            test_case_futures = [
+                pool.submit(_score_test_case, test_case_, scoring_metrics)
+                for test_case_ in test_cases
+            ]
+
+            test_results = [
+                test_case_future.result()
+                for test_case_future in tqdm.tqdm(
+                    futures.as_completed(
+                        test_case_futures,
+                    ),
+                    disable=(verbose < 1),
+                    desc="Evaluation",
+                    total=len(test_case_futures),
+                )
+            ]
+
+    return test_results
diff --git a/sdks/python/src/opik/evaluation/utils.py b/sdks/python/src/opik/evaluation/utils.py
@@ -0,0 +1,60 @@
+from typing import List
+
+from opik.api_objects import opik_client
+from opik.evaluation import test_case
+
+from opik.rest_api.experiments.client import ExperimentPublic
+
+
+def get_experiment_by_name(
+    client: opik_client.Opik, experiment_name: str
+) -> ExperimentPublic:
+    experiments = client._rest_client.experiments.find_experiments(name=experiment_name)
+
+    if len(experiments.content) == 0:
+        raise ValueError(f"Experiment with name {experiment_name} not found")
+    elif len(experiments.content) > 1:
+        raise ValueError(f"Found multiple experiments with name {experiment_name}")
+
+    experiment = experiments.content[0]
+    return experiment
+
+
+def get_trace_project_name(client: opik_client.Opik, trace_id: str) -> str:
+    # We first need to get the project_id for the trace
+    traces = client.get_trace_content(id=trace_id)
+    project_id = traces.project_id
+
+    # Then we can get the project name
+    project_metadata = client.get_project(id=project_id)
+    return project_metadata.name
+
+
+def get_experiment_test_cases(
+    client: opik_client.Opik, experiment_id: str, dataset_id: str
+) -> List[test_case.TestCase]:
+    test_cases = []
+    page = 1
+
+    while True:
+        experiment_items_page = (
+            client._rest_client.datasets.find_dataset_items_with_experiment_items(
+                id=dataset_id, experiment_ids=f'["{experiment_id}"]', page=page
+            )
+        )
+        if len(experiment_items_page.content) == 0:
+            break
+
+        for item in experiment_items_page.content:
+            experiment_item = item.experiment_items[0]
+            test_cases += [
+                test_case.TestCase(
+                    trace_id=experiment_item.trace_id,
+                    dataset_item_id=experiment_item.dataset_item_id,
+                    task_output=experiment_item.output,
+                )
+            ]
+
+        page += 1
+
+    return test_cases