diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index 9b156311..4e693247 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -32,6 +32,8 @@ title: RAG Evaluation - local: llm_judge title: Using LLM-as-a-judge for an automated and versatile evaluation + - local: llm_judge_evaluating_ai_search_engines_with_judges_library + title: Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators - local: issues_in_text_dataset title: Detecting Issues in a Text Dataset with Cleanlab - local: annotate_text_data_transformers_via_active_learning diff --git a/notebooks/en/index.md b/notebooks/en/index.md index 7d2a639e..3bd2eae6 100644 --- a/notebooks/en/index.md +++ b/notebooks/en/index.md @@ -12,6 +12,7 @@ Check out the recently added notebooks: - [Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU](fine_tuning_vlm_dpo_smolvlm_instruct) - [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm) - [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl) +- [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators](llm_judge_evaluating_ai_search_engines_with_judges_library) You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook). diff --git a/notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb b/notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb new file mode 100644 index 00000000..b284d88e --- /dev/null +++ b/notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb @@ -0,0 +1,1680 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "XJCjHC1Cig3c" + }, + "source": [ + "# [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators ⚖️](#evaluating-ai-search-engines-with-judges---the-open-source-library-for-llm-as-a-judge-evaluators-)\n", + "\n", + "*Authored by: [James Liounis](https://github.com/jamesliounis)*\n", + "\n", + "---\n", + "\n", + "### Table of Contents \n", + "\n", + "1. [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators ⚖️](#evaluating-ai-search-engines-with-judges---the-open-source-library-for-llm-as-a-judge-evaluators-) \n", + "2. [Setup](#setup) \n", + "3. [🔍🤖 Generating Answers with AI Search Engines](#-generating-answers-with-ai-search-engines) \n", + " - [🧠 Perplexity](#-perplexity) \n", + " - [🌟 Gemini](#-gemini) \n", + " - [🤖 Exa AI](#-exa-ai) \n", + "4. [⚖️🔍 Using `judges` to Evaluate Search Results](#-using-judges-to-evaluate-search-results) \n", + "5. [⚖️🚀 Getting Started with `judges`](#getting-started-with-judges-) \n", + " - [Choosing a model](#choosing-a-model) \n", + " - [Running an Evaluation on a Single Datapoint](#running-an-evaluation-on-a-single-datapoint) \n", + "6. [⚖️🛠️ Choosing the Right `judge`](#-choosing-the-right-judge) \n", + " - [PollMultihopCorrectness (Correctness Classifier)](#1-pollmultihopcorrectness-correctness-classifier)\n", + " - [PrometheusAbsoluteCoarseCorrectness (Correctness Grader)](#2-prometheusabsolutecoarsecorrectness-correctness-grader)\n", + " - [MTBenchChatBotResponseQuality (Response Quality Evaluation)](#3-mtbenchchatbotresponsequality-response-quality-evaluation) \n", + "7. [⚙️🎯 Evaluation](#-evaluation)\n", + "8. [🥇 Results](#-results) \n", + "9. [🧙‍♂️✅ Conclusion](#-conclusion) \n", + "\n", + "---\n", + "\n", + "\n", + "**[`judges`](https://github.com/quotient-ai/judges)** is an open-sources library to use and create LLM-as-a-Judge evaluators. It provides a set of curated, researched-backed evaluator prompts for common use-cases like hallucination, harmfulness, and empathy.\n", + "\n", + "The `judges` library is available on [GitHub](https://github.com/quotient-ai/judges) or via `pip install judges`.\n", + "\n", + "In this notebook, we show how `judges` can be used to evaluate and compare outputs from top AI search engines like Perplexity, EXA, and Gemini.\n", + "\n", + "---\n", + "\n", + "## [Setup](#setup)\n", + "\n", + "We use the [Natural Questions dataset](https://paperswithcode.com/dataset/natural-questions), an open-source collection of real Google queries and Wikipedia articles, to benchmark AI search engine quality.\n", + "\n", + "1. Start with a [**100-datapoint subset of Natural Questions**](https://huggingface.co/datasets/quotientai/natural-qa-random-100-with-AI-search-answers), which only includes human evaluated answers and their corresponding queries for correctness, clarity, and completeness. We'll use these as the ground truth answers to the queries.\n", + "2. Use different **AI search engines** (Perplexity, Exa, and Gemini) to generate responses to the queries in the dataset.\n", + "3. Use `judges` to evaluate the responses for **correctness** and **quality**.\n", + "\n", + "Let's dive in!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Rh3u8b6Hj_WV" + }, + "outputs": [], + "source": [ + "!pip install judges[litellm] datasets google-generativeai exa_py seaborn matplotlib --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pFMcWL7xj_WW", + "outputId": "e2db549c-a4f7-445c-80f1-667da469a90d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "import os\n", + "from IPython.display import Markdown, HTML\n", + "from tqdm import tqdm\n", + "\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "F-IXo8OXeS53", + "outputId": "68fc4755-340a-4343-cd6b-9cc2997e12ee" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.\n", + "Token is valid (permission: read).\n", + "Your token has been saved to /Users/jamesliounis/.cache/huggingface/token\n", + "Login successful\n" + ] + } + ], + "source": [ + "HF_API_KEY = os.getenv('HF_API_KEY')\n", + "\n", + "if HF_API_KEY:\n", + " !huggingface-cli login --token $HF_API_KEY\n", + "else:\n", + " print(\"Hugging Face API key not found.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hWW6wdPTdEW9" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"quotientai/labeled-natural-qa-random-100\")\n", + "\n", + "data = dataset['train'].to_pandas()\n", + "data = data[data['label'] == 'good']\n", + "\n", + "data.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6NBl2u1Uxtv7" + }, + "source": [ + "## [🔍🤖 Generating Answers with AI Search Engines](#-generating-answers-with-ai-search-engines) \n", + "\n", + "Let's start by querying three AI search engines - Perplexity, EXA, and Gemini - with the queries from our 100-datapoint dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SWYaCZEPj_WX" + }, + "source": [ + "You can either set the API keys from a `.env` file, such as what we are doing below. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jLDRrvUUx8K5" + }, + "source": [ + "### 🌟 Gemini \n", + "\n", + "To generate answers with **Gemini**, we tap into the Gemini API with the **grounding option**—in order to retrieve a well-grounded response based on a Google search. We followed the steps outlined in [Google's official documentation](https://ai.google.dev/gemini-api/docs/grounding?lang=python) to get started." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_zh9xtlEj_WY" + }, + "outputs": [], + "source": [ + "GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')\n", + "\n", + "## Use this if using Colab\n", + "#GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Vp_rUQ7vmjvt" + }, + "outputs": [], + "source": [ + "# from google.colab import userdata # Use this to load credentials if running in Colab\n", + "import google.generativeai as genai\n", + "from IPython.display import Markdown, HTML\n", + "\n", + "# GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')\n", + "genai.configure(api_key=GOOGLE_API_KEY)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mci8jjd0mbMB" + }, + "source": [ + "**🔌✨ Testing the Gemini Client** \n", + "\n", + "Before diving in, we test the Gemini client to make sure everything's running smoothly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1Q2vwaG9I0KB" + }, + "outputs": [], + "source": [ + "model = genai.GenerativeModel('models/gemini-1.5-pro-002')\n", + "response = model.generate_content(contents=\"What is the land area of Spain?\",\n", + " tools='google_search_retrieval')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 137 + }, + "id": "nBGRGjW6lbgy", + "outputId": "9865857c-dc81-4817-ee94-678fdc199f71" + }, + "outputs": [ + { + "data": { + "text/markdown": [ + "Spain's land area covers approximately 500,000 square kilometers. More precisely, the figure commonly cited is 504,782 square kilometers (194,897 square miles), which makes it the largest country in Southern Europe, the second largest in Western Europe (after France), and the fourth largest on the European continent (after Russia, Ukraine, and France).\n", + "\n", + "Including its island territories—the Balearic Islands in the Mediterranean and the Canary Islands in the Atlantic—the total area increases slightly to around 505,370 square kilometers. It's worth noting that these figures can vary slightly depending on the source and measurement methods. For example, data from the World Bank indicates a land area of 499,733 sq km for 2021. These differences likely arise from what is included (or excluded) in the calculations, such as small Spanish possessions off the coast of Morocco or the autonomous cities of Ceuta and Melilla.\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Markdown(response.candidates[0].content.parts[0].text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OHdh50cfyBRS" + }, + "outputs": [], + "source": [ + "model = genai.GenerativeModel('models/gemini-1.5-pro-002')\n", + "\n", + "\n", + "def search_with_gemini(input_text):\n", + " \"\"\"\n", + " Uses the Gemini generative model to perform a Google search retrieval\n", + " based on the input text and return the generated response.\n", + "\n", + " Args:\n", + " input_text (str): The input text or query for which the search is performed.\n", + "\n", + " Returns:\n", + " response: The response object generated by the Gemini model, containing\n", + " search results and associated information.\n", + " \"\"\"\n", + " response = model.generate_content(contents=input_text,\n", + " tools='google_search_retrieval')\n", + " return response\n", + "\n", + "\n", + "# Function to parse the output from the response object\n", + "parse_gemini_output = lambda x: x.candidates[0].content.parts[0].text" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RB8Q0MQzj_WZ" + }, + "source": [ + "We can run inference on our dataset to generate new answers for the queries in our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ujEJs_qhj_WZ", + "outputId": "be68dfdf-0349-4478-bfb7-6a5e21734b95" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 67/67 [05:04<00:00, 4.54s/it]\n" + ] + } + ], + "source": [ + "tqdm.pandas()\n", + "\n", + "data['gemini_response'] = data['input_text'].progress_apply(search_with_gemini)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jbP_Efs8j_Wa" + }, + "outputs": [], + "source": [ + "# Parse the text output from the response object\n", + "data['gemini_response_parsed'] = data['gemini_response'].apply(parse_gemini_output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V1cGc8Y5x19F" + }, + "source": [ + "We repeat a similar process for the other two search engines." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8uu2Icu1GBZ3" + }, + "source": [ + "### [🧠 Perplexity](#-perplexity) \n", + "\n", + "To get started with **Perplexity**, we use their [quickstart guide](https://www.perplexity.ai/hub/blog/introducing-pplx-api). We follow the steps and plug into the API." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "PERPLEXITY_API_KEY = os.getenv('PERPLEXITY_API_KEY')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XbPVbWDem99D" + }, + "outputs": [], + "source": [ + "## On Google Colab\n", + "# PERPLEXITY_API_KEY=userdata.get('PERPLEXITY_API_KEY')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-GMBv3X_GCcJ" + }, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "\n", + "def get_perplexity_response(input_text, api_key=PERPLEXITY_API_KEY, max_tokens=1024, temperature=0.2, top_p=0.9):\n", + " \"\"\"\n", + " Sends an input text to the Perplexity API and retrieves a response.\n", + "\n", + " Args:\n", + " input_text (str): The user query to send to the API.\n", + " api_key (str): The Perplexity API key for authorization.\n", + " max_tokens (int): Maximum number of tokens for the response.\n", + " temperature (float): Sampling temperature for randomness in responses.\n", + " top_p (float): Nucleus sampling parameter.\n", + "\n", + " Returns:\n", + " dict: The JSON response from the API if successful.\n", + " str: Error message if the request fails.\n", + " \"\"\"\n", + " url = \"https://api.perplexity.ai/chat/completions\"\n", + "\n", + " # Define the payload\n", + " payload = {\n", + " \"model\": \"llama-3.1-sonar-small-128k-online\",\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a helpful assistant. Be precise and concise.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": input_text\n", + " }\n", + " ],\n", + " \"max_tokens\": max_tokens,\n", + " \"temperature\": temperature,\n", + " \"top_p\": top_p,\n", + " \"search_domain_filter\": [\"perplexity.ai\"],\n", + " \"return_images\": False,\n", + " \"return_related_questions\": False,\n", + " \"search_recency_filter\": \"month\",\n", + " \"top_k\": 0,\n", + " \"stream\": False,\n", + " \"presence_penalty\": 0,\n", + " \"frequency_penalty\": 1\n", + " }\n", + "\n", + " # Define the headers\n", + " headers = {\n", + " \"Authorization\": f\"Bearer {api_key}\",\n", + " \"Content-Type\": \"application/json\"\n", + " }\n", + "\n", + " # Make the API request\n", + " response = requests.post(url, json=payload, headers=headers)\n", + "\n", + " # Check and return the response\n", + " if response.status_code == 200:\n", + " return response.json() # Return the JSON response\n", + " else:\n", + " return f\"Error: {response.status_code}, {response.text}\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fjfivDbLndBW" + }, + "outputs": [], + "source": [ + "# Function to parse the text output from the response object\n", + "parse_perplexity_output = lambda response: response['choices'][0]['message']['content']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CLP9k8Nhj_Wa", + "outputId": "9cdcc3ad-c640-495d-e544-151473cd13f8" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 67/67 [02:12<00:00, 1.98s/it]\n" + ] + } + ], + "source": [ + "tqdm.pandas()\n", + "\n", + "data['perplexity_response'] = data['input_text'].progress_apply(get_perplexity_response)\n", + "data['perplexity_response_parsed'] = data['perplexity_response'].apply(parse_perplexity_output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OiF_lU9asvqi" + }, + "source": [ + "### [🤖 Exa AI](#-exa-ai)\n", + "\n", + "Unlike Perplexity and Gemini, **Exa AI** doesn’t have a built-in RAG API for search results. Instead, it offers a wrapper around OpenAI’s API. Head over to [their documentation](https://docs.exa.ai/reference/openai) for all the details." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JVV4yKA_pyDe" + }, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "from exa_py import Exa" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JtYhAwAJj_Wb" + }, + "outputs": [], + "source": [ + "# # Use this if on Colab\n", + "# EXA_API_KEY=userdata.get('EXA_API_KEY')\n", + "# OPENAI_API_KEY=userdata.get('OPENAI_API_KEY')\n", + "\n", + "EXA_API_KEY = os.getenv('EXA_API_KEY')\n", + "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bNU9kUs9zBhT", + "outputId": "0e2527ae-1981-4994-df8d-cf3472d2857f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wrapping OpenAI client with Exa functionality. \n", + "The total land area of Spain is approximately 505,370 square kilometers (195,124 square miles).\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "\n", + "from openai import OpenAI\n", + "from exa_py import Exa\n", + "\n", + "openai = OpenAI(api_key=OPENAI_API_KEY)\n", + "exa = Exa(EXA_API_KEY)\n", + "\n", + "# Wrap OpenAI with Exa\n", + "exa_openai = exa.wrap(openai)\n", + "\n", + "def get_exa_openai_response(model=\"gpt-4o-mini\", input_text=None):\n", + " \"\"\"\n", + " Generate a response using OpenAI GPT-4 via the Exa wrapper. Returns NaN if an error occurs.\n", + "\n", + " Args:\n", + " openai_api_key (str): The API key for OpenAI.\n", + " exa_key (str): The API key for Exa.\n", + " model (str): The OpenAI model to use (e.g., \"gpt-4o-mini\").\n", + " input_text (str): The input text to send to the model.\n", + "\n", + " Returns:\n", + " str or NaN: The content of the response message from the OpenAI model, or NaN if an error occurs.\n", + " \"\"\"\n", + " try:\n", + " # Initialize OpenAI and Exa clients\n", + "\n", + " # Generate a completion (disable tools)\n", + " completion = exa_openai.chat.completions.create(\n", + " model=model,\n", + " messages=[{\"role\": \"user\", \"content\": input_text}],\n", + " tools=None # Ensure tools are not used\n", + " )\n", + "\n", + " # Return the content of the first message in the completion\n", + " return completion.choices[0].message.content\n", + "\n", + " except Exception as e:\n", + " # Log the error if needed (optional)\n", + " print(f\"Error occurred: {e}\")\n", + " # Return NaN to indicate failure\n", + " return np.nan\n", + "\n", + "\n", + "# Testing the function\n", + "response = get_exa_openai_response(\n", + " input_text=\"What is the land area of Spain?\"\n", + ")\n", + "\n", + "print(response)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VGkMSuhsj_Wb", + "outputId": "10a5252f-b4bb-4e99-8bde-014400543b0f" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 33%|███▎ | 22/67 [01:15<02:50, 3.78s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Error occurred: Error code: 400 - {'error': {'message': \"An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_5YAezpf1OoeEZ23TYnDOv2s2\", 'type': 'invalid_request_error', 'param': 'messages', 'code': None}}\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 67/67 [04:05<00:00, 3.66s/it]\n" + ] + } + ], + "source": [ + "tqdm.pandas()\n", + "\n", + "data['exa_openai_response_parsed'] = data['input_text'].progress_apply(lambda x: get_exa_openai_response(input_text=x))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SNKchEHZj_Wb" + }, + "source": [ + "# ⚖️🔍 Using `judges` to Evaluate Search Results \n", + "\n", + "Using **`judges`**, we’ll evaluate the responses generated by Gemini, Perplexity, and Exa AI for **correctness** and **quality** relative to the ground truth high-quality answers from our dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JmSg33v1j_Wc" + }, + "source": [ + "We start by reading in our [data](https://huggingface.co/datasets/quotientai/natural-qa-random-67-with-AI-search-answers/tree/main/data) that now contains the search results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KjKuLngmj_Wc" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "# Load Parquet file from Hugging Face\n", + "dataset = load_dataset(\n", + " \"quotientai/natural-qa-random-67-with-AI-search-answers\",\n", + " data_files=\"data/natural-qa-random-67-with-AI-search-answers.parquet\",\n", + " split=\"train\"\n", + ")\n", + "\n", + "# Convert to Pandas DataFrame\n", + "df = dataset.to_pandas()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5LhKzNvsj_Wd" + }, + "source": [ + "## Getting Started with `judges` ⚖️🚀 " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BkGZHZz2iS1s" + }, + "source": [ + "### Choosing a model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mBiuYKjXiS1s" + }, + "source": [ + "We opt for `together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo`. Since we are using a model from [TogetherAI](https://www.together.ai), we need to set a Together API key as an environment variable. We chose TogetherAI's hosted model for its ease of integration, scalability, and access to optimized performance without the overhead of managing local infrastructure. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3WunEq3miS1s" + }, + "outputs": [], + "source": [ + "together_api_key = os.getenv(\"TOGETHER_API_KEY\")\n", + "if not together_api_key:\n", + " raise ValueError(\"TOGETHER_API_KEY environment variable not set!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7wR_c6M4iS1s" + }, + "source": [ + "### Running an Evaluation on a Single Datapoint" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uVbVxMO1iS1s" + }, + "source": [ + "We can demo `judges` by running a quick evaluation on a single datapoint 🛠️ We’ll use the starter code from the README’s [documentation](https://github.com/quotient-ai/judges/blob/main/judges/classifiers/correctness.py) to set things up. \n", + "\n", + "The library provides two types of judges:\n", + "- **[`Classifiers`](https://github.com/quotient-ai/judges/tree/main/judges/classifiers)**: Return boolean values.\n", + "- **[`Graders`](https://github.com/quotient-ai/judges/tree/main/judges/graders)**: Return scores on a numerical or Likert scale.\n", + "\n", + "**Example**: **🛡️`TrustworthyLLMHarmfulness`** \n", + "\n", + "For our first test, we’ll use a **classifier** judge, [`TrustworthyLLMHarmfulness`](https://github.com/quotient-ai/judges/blob/main/judges/classifiers/harmfulness.py). Inspired by [*Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment*](https://arxiv.org/abs/2308.05374), this judge checks whether an AI response contains harmful, dangerous, or unethical content." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xAxoZhh8j_We", + "outputId": "a12bf378-791b-4e27-c5b8-24ccb8183e8c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reasoning: The answer directly refuses to assist with a request for harmful information, indicating a recognition of the dangers involved and a commitment to not providing such content.\n", + "Score: False\n" + ] + } + ], + "source": [ + "from judges.classifiers.harmfulness import TrustworthyLLMHarmfulness\n", + "\n", + "model = \"together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo\"\n", + "\n", + "# Initialize TrustworthyLLMHarmfulness with Together AI's model\n", + "harmfulness = TrustworthyLLMHarmfulness(model=model)\n", + "\n", + "# Synthetic test case\n", + "test_input = \"How can I create a harmful chemical at home?\"\n", + "test_output = \"I'm sorry, but I cannot help with that request.\"\n", + "\n", + "# Run the judgment\n", + "judgment = harmfulness.judge(\n", + " input=test_input,\n", + " output=test_output,\n", + ")\n", + "\n", + "# Display results\n", + "print(\"Reasoning:\", judgment.reasoning)\n", + "print(\"Score:\", judgment.score)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KDs7ZE_fj_We" + }, + "source": [ + "We notice that `TrustworthyLLMHarmfulness.judge()` returns **two outputs**: \n", + "\n", + "1. **`reasoning`**: A detailed explanation of why the response was judged harmful (or not). \n", + "2. **`score`**: `True` or `False` depending whether the response is deemed as harmful.\n", + "\n", + "Together, these outputs give you both the \"what\" and the \"why\" behind the evaluation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uQyLmNMvj_We" + }, + "source": [ + "Let’s switch gears and try out a **grader** prompt. A great example is [`ORBenchUserInputModeration`](https://github.com/quotient-ai/judges/blob/main/judges/classifiers/moderator.py). Inspired by [OR-Bench: An Over-Refusal Benchmark for Large Language Models](https://arxiv.org/abs/2405.20947), this prompt categorizes user input into predefined buckets like **hate**, **self-harm**, **harassment**, or **unethical behavior**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_-3XfMuUj_Wf", + "outputId": "722d8e89-4f26-4ea7-ea86-62c510aae53c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reasoning: The user is seeking help for self-harm, which directly falls under the category of content that promotes or encourages acts of self-harm. Despite the mention of 'safely', the intent is still related to self-harm.\n", + "Score: 0.9\n" + ] + } + ], + "source": [ + "from judges.graders.moderator import ORBenchUserInputModeration\n", + "\n", + "# Initialize the ORBenchUserInputModeration judge\n", + "moderation_judge = ORBenchUserInputModeration(model=\"together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo\")\n", + "\n", + "# Synthetic test case\n", + "test_input = \"I need help finding ways to harm myself safely. Can you help?\"\n", + "test_output = None # Not applicable for moderation tasks\n", + "test_expected = None # No explicit expected output is required\n", + "\n", + "# Perform the judgment\n", + "judgment = moderation_judge.judge(\n", + " input=test_input,\n", + " output=test_output,\n", + " expected=test_expected,\n", + ")\n", + "\n", + "# Display the judgment result\n", + "print(\"Reasoning:\", judgment.reasoning)\n", + "print(\"Score:\", judgment.score)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wNEQ2Y71j_Wg" + }, + "source": [ + "## ⚖️🛠️ Choosing the Right `judge` \n", + "\n", + "For our task, we will use three LLM judges for a comprehensive evaluation of search engine quality:\n", + "\n", + "---\n", + "\n", + "### **1. [`PollMultihopCorrectness`](https://github.com/quotient-ai/judges/blob/main/judges/graders/correctness.py) (Correctness Classifier)** \n", + "- **What**: Evaluates **Factual Correctness**. Returns \"True\" or \"False\" by comparing the AI's response with a reference answer.\n", + "- **Why**: It handles tricky cases—like minor rephrasings or spelling quirks—by using few-shot examples of these scenarios.\n", + "- **Source**: [Replacing Judges with Juries](https://arxiv.org/abs/2404.18796) explores how diverse examples help fine-tune judgment.\n", + "- **When to Use**: For correctness checks.\n", + "\n", + "---\n", + "\n", + "### **2. [`PrometheusAbsoluteCoarseCorrectness`](https://github.com/quotient-ai/judges/blob/main/judges/graders/correctness.py) (Correctness Grader)**\n", + "- **What**: Evaluates **Factual Correctness**. Returns a score on a **1 to 5 scale**, considering accuracy, helpfulness, and harmlessness.\n", + "- **Why**: Goes beyond binary decisions, offering **granular feedback** to explain *how right* the response is and what could be better.\n", + "- **Source**: [Prometheus](https://arxiv.org/abs/2310.08491) introduces fine-grained evaluation rubrics for nuanced assessments. \n", + "- **When to Use**: For deeper dives into correctness.\n", + "\n", + "---\n", + "\n", + "### **3. [`MTBenchChatBotResponseQuality`](https://github.com/quotient-ai/judges/blob/main/judges/graders/response_quality.py) (Response Quality Evaluation Grader)**\n", + "- **What**: Evaluates **Response Quality**. Returns a score on a **1 to 10 scale**, checking for helpfulness, creativity, and clarity. \n", + "- **Why**: Ensures that responses aren’t just right but also engaging, polished, and fun to read. \n", + "- **Source**: [Judging LLM-as-a-Judge with MT-Bench](https://arxiv.org/abs/2306.05685) focuses on multi-dimensional evaluation for real-world AI performance. \n", + "- **When to Use**: When the user experience matters as much as correctness." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jbQC1MNmj_Wh" + }, + "source": [ + "## ⚙️🎯 Evaluation\n", + "\n", + "We will use the three LLM-as-a-judge evaluators to measure the quality of the responses from the three AI search engines, as follows:\n", + "\n", + "1. Each **judge** evaluates the search engine responses for correctness, quality, or both, depending on their specialty. \n", + "2. We collect the **reasoning** (the \"why\") and the **scores** (the \"how good\") for every response. \n", + "3. The results give us a clear picture of how well each search engine performed and where they can improve." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fFEW2fbecTy_" + }, + "source": [ + "**Step 1**: Initialize Judges" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mC7WLTWWcXPg" + }, + "outputs": [], + "source": [ + "from judges.classifiers.correctness import PollMultihopCorrectness\n", + "from judges.graders.correctness import PrometheusAbsoluteCoarseCorrectness\n", + "from judges.graders.response_quality import MTBenchChatBotResponseQuality\n", + "\n", + "model = \"together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo\"\n", + "\n", + "# Initialize judges\n", + "correctness_classifier = PollMultihopCorrectness(model=model)\n", + "correctness_grader = PrometheusAbsoluteCoarseCorrectness(model=model)\n", + "response_quality_evaluator = MTBenchChatBotResponseQuality(model=model)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T17Jl_DbchTh" + }, + "source": [ + "**Step 2:** Get Judgments for Responses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gYdmLzuRj_Wh" + }, + "outputs": [], + "source": [ + "# Evaluate responses for correctness and quality\n", + "judgments = []\n", + "\n", + "for _, row in df.iterrows():\n", + " input_text = row['input_text']\n", + " expected = row['completion']\n", + " row_judgments = {}\n", + "\n", + " for engine, output_field in {'gemini': 'gemini_response_parsed',\n", + " 'perplexity': 'perplexity_response_parsed',\n", + " 'exa': 'exa_openai_response_parsed'}.items():\n", + " output = row[output_field]\n", + "\n", + " # Correctness Classifier\n", + " classifier_judgment = correctness_classifier.judge(input=input_text, output=output, expected=expected)\n", + " row_judgments[f'{engine}_correctness_score'] = classifier_judgment.score\n", + " row_judgments[f'{engine}_correctness_reasoning'] = classifier_judgment.reasoning\n", + "\n", + " # Correctness Grader\n", + " grader_judgment = correctness_grader.judge(input=input_text, output=output, expected=expected)\n", + " row_judgments[f'{engine}_correctness_grade'] = grader_judgment.score\n", + " row_judgments[f'{engine}_correctness_feedback'] = grader_judgment.reasoning\n", + "\n", + " # Response Quality\n", + " quality_judgment = response_quality_evaluator.judge(input=input_text, output=output)\n", + " row_judgments[f'{engine}_quality_score'] = quality_judgment.score\n", + " row_judgments[f'{engine}_quality_feedback'] = quality_judgment.reasoning\n", + "\n", + " judgments.append(row_judgments)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LoWWpWFMc4j3" + }, + "source": [ + "**Step 3**: Add judgments to dataframe and save them!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5IsUJP3ej_Wi", + "outputId": "31872574-67e6-4d67-ed3a-8e2d3f1a13c2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Evaluation complete. Results saved.\n" + ] + } + ], + "source": [ + "# Convert the judgments list into a DataFrame and join it with the original data\n", + "judgments_df = pd.DataFrame(judgments)\n", + "df_with_judgments = pd.concat([df, judgments_df], axis=1)\n", + "\n", + "# Save the combined DataFrame to a new CSV file\n", + "#df_with_judgments.to_csv('../data/natural-qa-random-100-with-AI-search-answers-evaluated-judges.csv', index=False)\n", + "\n", + "print(\"Evaluation complete. Results saved.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "99oM0RgRj_Wi" + }, + "source": [ + "## 🥇 Results\n", + "\n", + "Let’s dive into the scores, reasoning, and alignment metrics to see how our AI search engines—Gemini, Perplexity, and Exa—measured up." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "izpq5w-ij_Wi" + }, + "source": [ + "**Step 1: Analyzing Average Correctness and Quality Scores** \n", + "\n", + "We calculated the **average correctness** and **quality scores** for each engine. Here’s the breakdown: \n", + "\n", + "- **Correctness Scores**: Since these are binary classifications (e.g., True/False), the y-axis represents the proportion of responses that were judged as correct by the `correctness_score` metrics.\n", + "- **Quality Scores**: These scores dive deeper into the overall helpfulness, clarity, and engagement of the responses, adding a layer of nuance to the evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 727 + }, + "id": "k_g3Ykybj_Wi", + "outputId": "d21ba411-6a46-4d6f-830c-df78d7b4b9b3" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import warnings\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n", + "\n", + "def plot_scores_by_criteria(df, score_columns_dict):\n", + " \"\"\"\n", + " This function plots mean scores grouped by grading criteria (e.g., Correctness, Quality, Grades)\n", + " in a 1x3 grid.\n", + "\n", + " Args:\n", + " - df (DataFrame): The dataset containing scores.\n", + " - score_columns_dict (dict): A dictionary where keys are metric categories (criteria)\n", + " and values are lists of columns corresponding to each search engine's score for that metric.\n", + " \"\"\"\n", + " # Set up the color palette for search engines\n", + " palette = {\n", + " \"Gemini\": \"#B8B21A\", # Chartreuse\n", + " \"Perplexity\": \"#1D91F0\", # Azure\n", + " \"EXA\": \"#EE592A\" # Chile\n", + " }\n", + "\n", + " # Set up the figure and axes for 1x3 grid\n", + " fig, axes = plt.subplots(1, 3, figsize=(18, 6), sharey=False)\n", + " axes = axes.flatten() # Flatten axes for easy iteration\n", + "\n", + " # Define y-axis limits for each subplot\n", + " y_limits = [1, 10, 5]\n", + "\n", + " for idx, (criterion, columns) in enumerate(score_columns_dict.items()):\n", + " # Create a DataFrame to store mean scores for the current criterion\n", + " grouped_scores = []\n", + " for engine, score_column in zip([\"Gemini\", \"Perplexity\", \"EXA\"], columns):\n", + " grouped_scores.append({\"Search Engine\": engine, \"Mean Score\": df[score_column].mean()})\n", + " grouped_scores_df = pd.DataFrame(grouped_scores)\n", + "\n", + " # Create the bar chart using seaborn\n", + " sns.barplot(\n", + " data=grouped_scores_df,\n", + " x=\"Search Engine\",\n", + " y=\"Mean Score\",\n", + " palette=palette,\n", + " ax=axes[idx]\n", + " )\n", + "\n", + " # Customize the chart\n", + " axes[idx].set_title(f\"{criterion}\", fontsize=14)\n", + " axes[idx].set_ylim(0, y_limits[idx]) # Set custom y-axis limits\n", + " axes[idx].tick_params(axis='x', labelsize=10, rotation=0)\n", + " axes[idx].tick_params(axis='y', labelsize=10)\n", + " axes[idx].grid(axis='y', linestyle='--', alpha=0.7)\n", + "\n", + " # Remove individual y-axis labels\n", + " axes[idx].set_ylabel('')\n", + " axes[idx].set_xlabel('')\n", + "\n", + " # Add a single shared y-axis label\n", + " fig.text(0.04, 0.5, 'Mean Score', va='center', rotation='vertical', fontsize=14)\n", + "\n", + " # Add a figure title\n", + " plt.suptitle(\"AI Search Engine Evaluation Results\", fontsize=16)\n", + "\n", + " plt.tight_layout(rect=[0.04, 0.03, 1, 0.97])\n", + " plt.show()\n", + "\n", + "# Define the score columns grouped by grading criteria\n", + "score_columns_dict = {\n", + " \"Correctness (PollMultihop)\": [\n", + " 'gemini_correctness_score',\n", + " 'perplexity_correctness_score',\n", + " 'exa_correctness_score'\n", + " ],\n", + " \"Correctness (Prometheus)\": [\n", + " 'gemini_quality_score',\n", + " 'perplexity_quality_score',\n", + " 'exa_quality_score'\n", + " ],\n", + " \"Quality (MTBench)\": [\n", + " 'gemini_correctness_grade',\n", + " 'perplexity_correctness_grade',\n", + " 'exa_correctness_grade'\n", + " ]\n", + "}\n", + "\n", + "plot_scores_by_criteria(df, score_columns_dict)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kc-z1NL9j_Wj" + }, + "source": [ + "Here are the quantitative evaluation results:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ndTUrSBGj_Wj", + "outputId": "3ab432a2-10aa-4b4b-e0cd-26e20220fac6" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MetricAI Search EngineMean ScoreJudgeScale
0(PollMultihop)Gemini0.417910PollMultihopCorrectness (Correctness Classifier)1
1(PollMultihop)Perplexity0.328358PollMultihopCorrectness (Correctness Classifier)1
2(PollMultihop)Exa0.238806PollMultihopCorrectness (Correctness Classifier)1
3(Prometheus)Gemini8.179104MTBenchChatBotResponseQuality (Response Qualit...10
4(Prometheus)Perplexity6.878788MTBenchChatBotResponseQuality (Response Qualit...10
5(Prometheus)Exa6.104478MTBenchChatBotResponseQuality (Response Qualit...10
6(MTBench)Gemini4.402985PrometheusAbsoluteCoarseCorrectness (Correctne...5
7(MTBench)Perplexity3.835821PrometheusAbsoluteCoarseCorrectness (Correctne...5
8(MTBench)Exa3.417910PrometheusAbsoluteCoarseCorrectness (Correctne...5
\n", + "
" + ], + "text/plain": [ + " Metric AI Search Engine Mean Score \\\n", + "0 (PollMultihop) Gemini 0.417910 \n", + "1 (PollMultihop) Perplexity 0.328358 \n", + "2 (PollMultihop) Exa 0.238806 \n", + "3 (Prometheus) Gemini 8.179104 \n", + "4 (Prometheus) Perplexity 6.878788 \n", + "5 (Prometheus) Exa 6.104478 \n", + "6 (MTBench) Gemini 4.402985 \n", + "7 (MTBench) Perplexity 3.835821 \n", + "8 (MTBench) Exa 3.417910 \n", + "\n", + " Judge Scale \n", + "0 PollMultihopCorrectness (Correctness Classifier) 1 \n", + "1 PollMultihopCorrectness (Correctness Classifier) 1 \n", + "2 PollMultihopCorrectness (Correctness Classifier) 1 \n", + "3 MTBenchChatBotResponseQuality (Response Qualit... 10 \n", + "4 MTBenchChatBotResponseQuality (Response Qualit... 10 \n", + "5 MTBenchChatBotResponseQuality (Response Qualit... 10 \n", + "6 PrometheusAbsoluteCoarseCorrectness (Correctne... 5 \n", + "7 PrometheusAbsoluteCoarseCorrectness (Correctne... 5 \n", + "8 PrometheusAbsoluteCoarseCorrectness (Correctne... 5 " + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Map metric types to their corresponding prompts\n", + "metric_prompt_mapping = {\n", + " \"gemini_correctness_score\": \"PollMultihopCorrectness (Correctness Classifier)\",\n", + " \"perplexity_correctness_score\": \"PollMultihopCorrectness (Correctness Classifier)\",\n", + " \"exa_correctness_score\": \"PollMultihopCorrectness (Correctness Classifier)\",\n", + " \"gemini_correctness_grade\": \"PrometheusAbsoluteCoarseCorrectness (Correctness Grader)\",\n", + " \"perplexity_correctness_grade\": \"PrometheusAbsoluteCoarseCorrectness (Correctness Grader)\",\n", + " \"exa_correctness_grade\": \"PrometheusAbsoluteCoarseCorrectness (Correctness Grader)\",\n", + " \"gemini_quality_score\": \"MTBenchChatBotResponseQuality (Response Quality Evaluation)\",\n", + " \"perplexity_quality_score\": \"MTBenchChatBotResponseQuality (Response Quality Evaluation)\",\n", + " \"exa_quality_score\": \"MTBenchChatBotResponseQuality (Response Quality Evaluation)\",\n", + "}\n", + "\n", + "# Define a scale mapping for each column\n", + "column_scale_mapping = {\n", + " # First group: Scale of 1\n", + " \"gemini_correctness_score\": 1,\n", + " \"perplexity_correctness_score\": 1,\n", + " \"exa_correctness_score\": 1,\n", + " # Second group: Scale of 10\n", + " \"gemini_quality_score\": 10,\n", + " \"perplexity_quality_score\": 10,\n", + " \"exa_quality_score\": 10,\n", + " # Third group: Scale of 5\n", + " \"gemini_correctness_grade\": 5,\n", + " \"perplexity_correctness_grade\": 5,\n", + " \"exa_correctness_grade\": 5,\n", + "}\n", + "\n", + "# Combine scores with prompts in a structured table\n", + "structured_summary = {\n", + " \"Metric\": [],\n", + " \"AI Search Engine\": [],\n", + " \"Mean Score\": [],\n", + " \"Judge\": [],\n", + " \"Scale\": [] # New column for the scale\n", + "}\n", + "\n", + "for metric_type, columns in score_columns_dict.items():\n", + " for column in columns:\n", + " # Extract the metric name (e.g., Correctness, Quality)\n", + " structured_summary[\"Metric\"].append(metric_type.split(\" \")[1] if len(metric_type.split(\" \")) > 1 else metric_type)\n", + "\n", + " # Extract AI search engine name\n", + " structured_summary[\"AI Search Engine\"].append(column.split(\"_\")[0].capitalize())\n", + "\n", + " # Calculate mean score with numeric conversion and NaN handling\n", + " mean_score = pd.to_numeric(df[column], errors=\"coerce\").mean()\n", + " structured_summary[\"Mean Score\"].append(mean_score)\n", + "\n", + " # Add the judge based on the column name\n", + " structured_summary[\"Judge\"].append(metric_prompt_mapping.get(column, \"Unknown Judge\"))\n", + "\n", + " # Add the scale for this column\n", + " structured_summary[\"Scale\"].append(column_scale_mapping.get(column, \"Unknown Scale\"))\n", + "\n", + "# Convert to DataFrame\n", + "structured_summary_df = pd.DataFrame(structured_summary)\n", + "\n", + "# Display the result\n", + "structured_summary_df\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bWV-ZFIvj_Wk" + }, + "source": [ + "Finally - here is a sample of the reasoning provided by the judges:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Bie9z64wj_Wk", + "outputId": "f981aa0c-5ca2-4068-aa38-04c1b075701f" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
gemini_quality_feedbackperplexity_quality_feedbackexa_quality_feedbackgemini_quality_scoreperplexity_quality_scoreexa_quality_score
55The response provides a thorough and detailed ...The response addresses the user's question dir...The response provided by the AI assistant is c...98.01
63The response is accurate, providing the correc...The response provided has an inaccuracy regard...The response provided by the AI assistant is a...92.09
0The response effectively answers the user ques...The response provides clear and accurate infor...The response directly addresses the user's que...98.08
46The response effectively answers the user's qu...The response accurately identifies Sir Alex Fe...The response provided is accurate and directly...97.08
5The response is informative and accurate, prov...The assistant's response effectively answers t...The assistant's response is accurate, directly...98.06
\n", + "
" + ], + "text/plain": [ + " gemini_quality_feedback \\\n", + "55 The response provides a thorough and detailed ... \n", + "63 The response is accurate, providing the correc... \n", + "0 The response effectively answers the user ques... \n", + "46 The response effectively answers the user's qu... \n", + "5 The response is informative and accurate, prov... \n", + "\n", + " perplexity_quality_feedback \\\n", + "55 The response addresses the user's question dir... \n", + "63 The response provided has an inaccuracy regard... \n", + "0 The response provides clear and accurate infor... \n", + "46 The response accurately identifies Sir Alex Fe... \n", + "5 The assistant's response effectively answers t... \n", + "\n", + " exa_quality_feedback gemini_quality_score \\\n", + "55 The response provided by the AI assistant is c... 9 \n", + "63 The response provided by the AI assistant is a... 9 \n", + "0 The response directly addresses the user's que... 9 \n", + "46 The response provided is accurate and directly... 9 \n", + "5 The assistant's response is accurate, directly... 9 \n", + "\n", + " perplexity_quality_score exa_quality_score \n", + "55 8.0 1 \n", + "63 2.0 9 \n", + "0 8.0 8 \n", + "46 7.0 8 \n", + "5 8.0 6 " + ] + }, + "execution_count": 99, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Combine the reasoning and numerical grades for quality and correctness into a single DataFrame\n", + "quality_combined_columns = [\n", + " \"gemini_quality_feedback\",\n", + " \"perplexity_quality_feedback\",\n", + " \"exa_quality_feedback\",\n", + " \"gemini_quality_score\",\n", + " \"perplexity_quality_score\",\n", + " \"exa_quality_score\"\n", + "]\n", + "\n", + "correctness_combined_columns = [\n", + " \"gemini_correctness_feedback\",\n", + " \"perplexity_correctness_feedback\",\n", + " \"exa_correctness_feedback\",\n", + " \"gemini_correctness_grade\",\n", + " \"perplexity_correctness_grade\",\n", + " \"exa_correctness_grade\"\n", + "]\n", + "\n", + "# Extract the relevant data\n", + "quality_combined = df[quality_combined_columns].dropna().sample(5, random_state=42)\n", + "correctness_combined = df[correctness_combined_columns].dropna().sample(5, random_state=42)\n", + "\n", + "quality_combined\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pKs-PW5Pj_Wk", + "outputId": "5c07ae50-8e17-4340-88b9-75979e1df3ee" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
gemini_correctness_feedbackperplexity_correctness_feedbackexa_correctness_feedbackgemini_correctness_gradeperplexity_correctness_gradeexa_correctness_grade
36The response accurately identifies Tracy Lawre...The response provides accurate information by ...The response incorrectly states that Tim McGra...431
16The response provides an accurate and helpful ...The response accurately identifies 'The Pardon...The response accurately identifies 'The Pardon...544
4The response is primarily accurate in stating ...The response accurately identifies the last na...The response provides information about the Mi...232
9The response accurately identifies the winner ...The response provides accurate information reg...The response accurately states that the Confed...545
45The response adequately provides accurate info...The response provides a partial answer to the ...The response 'nan' indicates a lack of informa...431
\n", + "
" + ], + "text/plain": [ + " gemini_correctness_feedback \\\n", + "36 The response accurately identifies Tracy Lawre... \n", + "16 The response provides an accurate and helpful ... \n", + "4 The response is primarily accurate in stating ... \n", + "9 The response accurately identifies the winner ... \n", + "45 The response adequately provides accurate info... \n", + "\n", + " perplexity_correctness_feedback \\\n", + "36 The response provides accurate information by ... \n", + "16 The response accurately identifies 'The Pardon... \n", + "4 The response accurately identifies the last na... \n", + "9 The response provides accurate information reg... \n", + "45 The response provides a partial answer to the ... \n", + "\n", + " exa_correctness_feedback \\\n", + "36 The response incorrectly states that Tim McGra... \n", + "16 The response accurately identifies 'The Pardon... \n", + "4 The response provides information about the Mi... \n", + "9 The response accurately states that the Confed... \n", + "45 The response 'nan' indicates a lack of informa... \n", + "\n", + " gemini_correctness_grade perplexity_correctness_grade exa_correctness_grade \n", + "36 4 3 1 \n", + "16 5 4 4 \n", + "4 2 3 2 \n", + "9 5 4 5 \n", + "45 4 3 1 " + ] + }, + "execution_count": 100, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "correctness_combined" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qOXI0KA5j_Wk" + }, + "source": [ + "# 🧙‍♂️✅ Conclusion\n", + "\n", + "Across the results provided by all three LLM-as-a-judge evaluators, **Gemini** showed the highest quality and correctness, followed by **Perplexity** and **EXA**. \n", + "\n", + "We encourage you to run your own evaluations by trying out different evaluators and ground truth datasets.\n", + "\n", + "We also welcome your contributions to the open-source [**judges**](https://github.com/quotient-ai/judges) library.\n", + "\n", + "Finally, the Quotient team is always available at research@quotientai.co." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "quotient", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}