diff --git a/docs/evaluation/tutorials/backtesting.mdx b/docs/evaluation/tutorials/backtesting.mdx index e2d194a2..d94f8030 100644 --- a/docs/evaluation/tutorials/backtesting.mdx +++ b/docs/evaluation/tutorials/backtesting.mdx @@ -2,7 +2,7 @@ sidebar_position: 5 --- -# Backtesting +# Run backtests on a new version of an agent Deploying your application is just the beginning of a continuous improvement process. After you deploy to production, you'll want to refine your system by enhancing prompts, language models, tools, and architectures. diff --git a/docs/evaluation/tutorials/evaluation.mdx b/docs/evaluation/tutorials/evaluation.mdx index 31f7e573..055c55f9 100644 --- a/docs/evaluation/tutorials/evaluation.mdx +++ b/docs/evaluation/tutorials/evaluation.mdx @@ -2,15 +2,13 @@ sidebar_position: 2 --- -# Evaluate your LLM application +# Evaluate a chatbot -It can be hard to measure the performance of your application with respect to criteria important you or your users. -However, doing so is crucial, especially as you iterate on your application. -In this guide we will go over how to test and evaluate your application. -This allows you to measure how well your application is performing over a **fixed** set of data. +In this guide we will set up evaluations for a chatbot. +These allow you to measure how well your application is performing over a **fixed** set of data. Being able to get this insight quickly and reliably will allow you to iterate with confidence. -At a high level, in this tutorial we will go over how to: +At a high level, in this tutorial we will: - _Create an initial golden dataset to measure performance_ - _Define metrics to use to measure performance_ @@ -23,6 +21,22 @@ For more information on the evaluation workflows LangSmith supports, check out t Lots to cover, let's dive in! +## Setup + +First install the required dependencies for this tutorial. +We happen to use OpenAI, but LangSmith can be used with any model: + +```bash +pip install -U langsmith openai +``` + +And set environment variables to enable LangSmith tracing: +```bash +export LANGSMITH_TRACING="true" +export LANGSMITH_API_KEY="" +export OPENAI_API_KEY="" +``` + ## Create a dataset The first step when getting ready to test and evaluate your application is to define the datapoints you want to evaluate. @@ -109,44 +123,40 @@ This **LLM-as-a-judge** is relatively common for cases that are too complex to m We can define our own prompt and LLM to use for evaluation here: ```python -from langchain_anthropic import ChatAnthropic -from langchain_core.prompts.prompt import PromptTemplate -from langsmith.evaluation import LangChainStringEvaluator +import openai +from langsmith import wrappers + +openai_client = wrappers.wrap_openai(openai.OpenAI()) -_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students' answers to questions. -You are grading the following question: -{query} +eval_instructions = "You are an expert professor specialized in grading students' answers to questions." + +def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool: + user_content = f"""You are grading the following question: +{inputs['question']} Here is the real answer: -{answer} +{reference_outputs['answer']} You are grading the following predicted answer: -{result} +{outputs['response']} Respond with CORRECT or INCORRECT: Grade: """ - -PROMPT = PromptTemplate( - input_variables=["query", "answer", "result"], template=_PROMPT_TEMPLATE -) -eval_llm = ChatAnthropic(temperature=0.0) - -qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm, "prompt": PROMPT}) + response = openai_client.chat.completions.create( + model="gpt-4o-mini", + temperature=0, + messages=[ + {"role": "system", "content": eval_instructions}, + {"role": "user", "content": user_content}, + ], + ).choices[0].message.content + return response == "CORRECT" ``` -:::note -This example assumes you have the `ANTHROPIC_API_KEY` environment variable set. You can just as easily run this example with OpenAI by replacing `ChatAnthropic` with `ChatOpenAI` from `langchain_openai`. -::: - For evaluating the length of the response, this is a lot easier! We can just define a simple function that checks whether the actual output is less than 2x the length of the expected result. ```python -from langsmith.schemas import Run, Example - -def evaluate_length(run: Run, example: Example) -> dict: - prediction = run.outputs.get("output") or "" - required = example.outputs.get("answer") or "" - score = int(len(prediction) < 2 * len(required)) - return {"key":"length", "score": score} +def concision(outputs: dict, reference_outputs: dict) -> bool: + return int(len(outputs["response"]) < 2 * len(reference_outputs["answer"])) ``` ## Run Evaluations @@ -157,23 +167,15 @@ We will build a simple application that just has a system message with instructi We will build this using the OpenAI SDK directly: ```python -import openai - -openai_client = openai.Client() +default_instructions = "Respond to the users question in a short, concise manner (one short sentence)." -def my_app(question): +def my_app(question: str, model: str = "gpt-4o-mini", instructions: str = default_instructions) -> str: return openai_client.chat.completions.create( - model="gpt-4o-mini", + model=model, temperature=0, messages=[ - { - "role": "system", - "content": "Respond to the users question in a short, concise manner (one short sentence)." - }, - { - "role": "user", - "content": question, - } + {"role": "system", "content": instructions}, + {"role": "user", "content": question}, ], ).choices[0].message.content ``` @@ -182,36 +184,21 @@ Before running this through LangSmith evaluations, we need to define a simple wr and then also maps the output of the function to the output key we expect. ```python -def langsmith_app(inputs): - output = my_app(inputs["question"]) - return {"output": output} +def ls_target(inputs: str) -> dict: + return {"response": my_app(inputs["question"])} ``` Great! -Now we're ready to run evaluation. +Now we're ready to run an evaluation. Let's do it! ```python -from langsmith import evaluate - -experiment_results = evaluate( - langsmith_app, # Your AI system +experiment_results = client.evaluate( + ls_target, # Your AI system data=dataset_name, # The data to predict and grade over - evaluators=[evaluate_length, qa_evaluator], # The evaluators to score the results - experiment_prefix="openai-3.5", # A prefix for your experiment names to easily identify them + evaluators=[concision, correctness], # The evaluators to score the results + experiment_prefix="openai-4o-mini", # A prefix for your experiment names to easily identify them ) - -# Note: If your system is async, you can use the asynchronous `aevaluate` function -# import asyncio -# from langsmith import aevaluate -# -# experiment_results = asyncio.run(aevaluate( -# my_async_langsmith_app, # Your AI system -# data=dataset_name, # The data to predict and grade over -# evaluators=[evaluate_length, qa_evaluator], # The evaluators to score the results -# experiment_prefix="openai-3.5", # A prefix for your experiment names to easily identify them -# )) - ``` This will output a URL. If we click on it, we should see results of our evaluation! @@ -225,76 +212,36 @@ If we go back to the dataset page and select the `Experiments` tab, we can now s Let's now try it out with a different model! Let's try `gpt-4-turbo` ```python -import openai - -openai_client = openai.Client() - -def my_app_1(question): - return openai_client.chat.completions.create( - model="gpt-4-turbo", - temperature=0, - messages=[ - { - "role": "system", - "content": "Respond to the users question in a short, concise manner (one short sentence)." - }, - { - "role": "user", - "content": question, - } - ], - ).choices[0].message.content - - -def langsmith_app_1(inputs): - output = my_app_1(inputs["question"]) - return {"output": output} - -from langsmith import evaluate - -experiment_results = evaluate( - langsmith_app_1, # Your AI system - data=dataset_name, # The data to predict and grade over - evaluators=[evaluate_length, qa_evaluator], # The evaluators to score the results - experiment_prefix="openai-4", # A prefix for your experiment names to easily identify them +def ls_target_v2(inputs: str) -> dict: + return {"response": my_app(inputs["question"], model="gpt-4-turbo")} + +experiment_results = client.evaluate( + ls_target_v2, + data=dataset_name, + evaluators=[concision, correctness], + experiment_prefix="openai-4-turbo", ) ``` And now let's use GPT-4 but also update the prompt to be a bit more strict in requiring the answer to be short. ```python -import openai +instructions_v3 = "Respond to the users question in a short, concise manner (one short sentence). Do NOT use more than ten words." -openai_client = openai.Client() - -def my_app_2(question): - return openai_client.chat.completions.create( +def ls_target_v3(inputs: str) -> dict: + response = my_app( + inputs["question"], model="gpt-4-turbo", - temperature=0, - messages=[ - { - "role": "system", - "content": "Respond to the users question in a short, concise manner (one short sentence). Do NOT use more than ten words." - }, - { - "role": "user", - "content": question, - } - ], - ).choices[0].message.content - - -def langsmith_app_2(inputs): - output = my_app_2(inputs["question"]) - return {"output": output} + instructions=instructions_v3 + ) + return {"response": response} -from langsmith import evaluate -experiment_results = evaluate( - langsmith_app_2, # Your AI system - data=dataset_name, # The data to predict and grade over - evaluators=[evaluate_length, qa_evaluator], # The evaluators to score the results - experiment_prefix="strict-openai-4", # A prefix for your experiment names to easily identify them +experiment_results = client.evaluate( + ls_target_v3, + data=dataset_name, + evaluators=[concision, correctness], + experiment_prefix="strict-openai-4-turbo", ) ``` @@ -313,15 +260,12 @@ If we do that, we can see a high level view of the metrics for each run: Great! So we can tell that GPT-4 is better than GPT-3.5 at knowing who companies are, and we can see that the strict prompt helped a lot with the length. But what if we want to explore in more detail? -In order to do that, we can select all the runs we want to compare (in this case all three) and open them up in a comparison view: - -![](./static/testing_tutorial_open_compare.png) - +In order to do that, we can select all the runs we want to compare (in this case all three) and open them up in a comparison view. We immediately see all three tests side by side. Some of the cells are color coded - this is showing a regression of _a certain metric_ compared to _a certain baseline_. -We automatically choose defaults for the baseline and metric, but you can change those yourself (outlined in blue below). -You can also choose which columns and which metrics you see by using the `Display` control (outlined in yellow below). -You can also automatically filter to only see the runs that have improvements/regressions by clicking on the icons at the top (outlined in red below). +We automatically choose defaults for the baseline and metric, but you can change those yourself. +You can also choose which columns and which metrics you see by using the `Display` control. +You can also automatically filter to only see the runs that have improvements/regressions by clicking on the icons at the top. ![](./static/testing_tutorial_compare_runs.png) @@ -341,14 +285,14 @@ we could set that up with a test like: def test_length_score() -> None: """Test that the length score is at least 80%.""" experiment_results = evaluate( - langsmith_app, # Your AI system + ls_target, # Your AI system data=dataset_name, # The data to predict and grade over - evaluators=[evaluate_length, qa_evaluator], # The evaluators to score the results + evaluators=[concision, correctness], # The evaluators to score the results ) # This will be cleaned up in the next release: feedback = client.list_feedback( run_ids=[r.id for r in client.list_runs(project_name=experiment_results.experiment_name)], - feedback_key="length" + feedback_key="concision" ) scores = [f.score for f in feedback] assert sum(scores) / len(scores) >= 0.8, "Aggregate score should be at least .8" @@ -377,3 +321,115 @@ For information on this, check out the [how-to guides](../../evaluation/how_to_g Additionally, there are other ways to evaluate data besides in this "offline" manner (e.g. you can evaluate production data). For more information on online evaluation, check out [this guide](../../observability/how_to_guides/monitoring/online_evaluations). + +## Reference code + +
+Click to see a consolidated code snippet +```python +import openai +from langsmith import Client, wrappers + +# Application code +openai_client = wrappers.wrap_openai(openai.OpenAI()) + +default_instructions = "Respond to the users question in a short, concise manner (one short sentence)." + +def my_app(question: str, model: str = "gpt-4o-mini", instructions: str = default_instructions) -> str: + return openai_client.chat.completions.create( + model=model, + temperature=0, + messages=[ + {"role": "system", "content": instructions}, + {"role": "user", "content": question}, + ], + ).choices[0].message.content + +client = Client() + +# Define dataset: these are your test cases +dataset_name = "QA Example Dataset" +dataset = client.create_dataset(dataset_name) +client.create_examples( + inputs=[ + {"question": "What is LangChain?"}, + {"question": "What is LangSmith?"}, + {"question": "What is OpenAI?"}, + {"question": "What is Google?"}, + {"question": "What is Mistral?"}, + ], + outputs=[ + {"answer": "A framework for building LLM applications"}, + {"answer": "A platform for observing and evaluating LLM applications"}, + {"answer": "A company that creates Large Language Models"}, + {"answer": "A technology company known for search"}, + {"answer": "A company that creates Large Language Models"}, + ], + dataset_id=dataset.id, +) + +# Define evaluators +eval_instructions = "You are an expert professor specialized in grading students' answers to questions." + +def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool: + user_content = f"""You are grading the following question: +{inputs['question']} +Here is the real answer: +{reference_outputs['answer']} +You are grading the following predicted answer: +{outputs['response']} +Respond with CORRECT or INCORRECT: +Grade: +""" + response = openai_client.chat.completions.create( + model="gpt-4o-mini", + temperature=0, + messages=[ + {"role": "system", "content": eval_instructions}, + {"role": "user", "content": user_content}, + ], + ).choices[0].message.content + return response == "CORRECT" + +def concision(outputs: dict, reference_outputs: dict) -> bool: + return int(len(outputs["response"]) < 2 * len(reference_outputs["answer"])) + +# Run evaluations +def ls_target(inputs: str) -> dict: + return {"response": my_app(inputs["question"])} + +experiment_results_v1 = client.evaluate( + ls_target, # Your AI system + data=dataset_name, # The data to predict and grade over + evaluators=[concision, correctness], # The evaluators to score the results + experiment_prefix="openai-4o-mini", # A prefix for your experiment names to easily identify them +) + +def ls_target_v2(inputs: str) -> dict: + return {"response": my_app(inputs["question"], model="gpt-4-turbo")} + +experiment_results_v2 = client.evaluate( + ls_target_v2, + data=dataset_name, + evaluators=[concision, correctness], + experiment_prefix="openai-4-turbo", +) + +instructions_v3 = "Respond to the users question in a short, concise manner (one short sentence). Do NOT use more than ten words." + +def ls_target_v3(inputs: str) -> dict: + response = my_app( + inputs["question"], + model="gpt-4-turbo", + instructions=instructions_v3 + ) + return {"response": response} + +experiment_results_v3 = client.evaluate( + ls_target_v3, + data=dataset_name, + evaluators=[concision, correctness], + experiment_prefix="strict-openai-4-turbo", +) +``` +
\ No newline at end of file diff --git a/docs/evaluation/tutorials/index.mdx b/docs/evaluation/tutorials/index.mdx index 9e313600..428a9b07 100644 --- a/docs/evaluation/tutorials/index.mdx +++ b/docs/evaluation/tutorials/index.mdx @@ -2,7 +2,7 @@ New to LangSmith or to LLM app development in general? Read this material to quickly get up and running. -- [Evaluate your LLM application](./tutorials/evaluation) -- [RAG Evaluations](./tutorials/rag) -- [Backtesting](./tutorials/backtesting) +- [Evaluate a chatbot](./tutorials/evaluation) +- [Evaluate a RAG application](./tutorials/rag) - [Evaluate an agent](./tutorials/agents) +- [Run backtests on a new version of an agent](./tutorials/backtesting) diff --git a/docs/evaluation/tutorials/rag.mdx b/docs/evaluation/tutorials/rag.mdx index 123d6b23..30b7359a 100644 --- a/docs/evaluation/tutorials/rag.mdx +++ b/docs/evaluation/tutorials/rag.mdx @@ -8,7 +8,7 @@ import { typescript, } from "@site/src/components/InstructionsWithCode"; -# RAG Evaluations +# Evaluate a RAG application :::info Key concepts [RAG evaluation](/evaluation/concepts#retrieval-augmented-generation-rag) | [Evaluators](/evaluation/concepts#evaluators) | [LLM-as-judge evaluators](/evaluation/concepts#llm-as-judge) diff --git a/docs/evaluation/tutorials/static/testing_tutorial_compare_metrics.png b/docs/evaluation/tutorials/static/testing_tutorial_compare_metrics.png index c7cf4c9a..e2e687ec 100644 Binary files a/docs/evaluation/tutorials/static/testing_tutorial_compare_metrics.png and b/docs/evaluation/tutorials/static/testing_tutorial_compare_metrics.png differ diff --git a/docs/evaluation/tutorials/static/testing_tutorial_compare_runs.png b/docs/evaluation/tutorials/static/testing_tutorial_compare_runs.png index 3e127f39..d5df035c 100644 Binary files a/docs/evaluation/tutorials/static/testing_tutorial_compare_runs.png and b/docs/evaluation/tutorials/static/testing_tutorial_compare_runs.png differ diff --git a/docs/evaluation/tutorials/static/testing_tutorial_one_run.png b/docs/evaluation/tutorials/static/testing_tutorial_one_run.png index edf39fb3..5e6c2511 100644 Binary files a/docs/evaluation/tutorials/static/testing_tutorial_one_run.png and b/docs/evaluation/tutorials/static/testing_tutorial_one_run.png differ diff --git a/docs/evaluation/tutorials/static/testing_tutorial_open_compare.png b/docs/evaluation/tutorials/static/testing_tutorial_open_compare.png deleted file mode 100644 index be2c1702..00000000 Binary files a/docs/evaluation/tutorials/static/testing_tutorial_open_compare.png and /dev/null differ diff --git a/docs/evaluation/tutorials/static/testing_tutorial_over_time.png b/docs/evaluation/tutorials/static/testing_tutorial_over_time.png index 7fb62d40..759e056d 100644 Binary files a/docs/evaluation/tutorials/static/testing_tutorial_over_time.png and b/docs/evaluation/tutorials/static/testing_tutorial_over_time.png differ diff --git a/docs/evaluation/tutorials/static/testing_tutorial_run.png b/docs/evaluation/tutorials/static/testing_tutorial_run.png index 32f29fb9..506d8d85 100644 Binary files a/docs/evaluation/tutorials/static/testing_tutorial_run.png and b/docs/evaluation/tutorials/static/testing_tutorial_run.png differ diff --git a/docs/evaluation/tutorials/static/testing_tutorial_side_panel.png b/docs/evaluation/tutorials/static/testing_tutorial_side_panel.png index 41a94106..19753f70 100644 Binary files a/docs/evaluation/tutorials/static/testing_tutorial_side_panel.png and b/docs/evaluation/tutorials/static/testing_tutorial_side_panel.png differ diff --git a/docs/evaluation/tutorials/static/testing_tutorial_three_runs.png b/docs/evaluation/tutorials/static/testing_tutorial_three_runs.png index d8cf2afe..972ac459 100644 Binary files a/docs/evaluation/tutorials/static/testing_tutorial_three_runs.png and b/docs/evaluation/tutorials/static/testing_tutorial_three_runs.png differ