Add notebook: Evaluating AI search engines with the judges library #257

jamesliounis · 2024-12-17T17:39:47Z

Description

This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.

What is judges?

judges is an open-source library that provides researched-backed, ready-to-use LLM-based evaluators for assessing outputs across various dimensions such as correctness, quality, and harmfulness. It supports both:

Classifiers (binary evaluations like True/False).
Graders (scored evaluations on numerical scales).
The library also provides an integration with litellm, allowing access to most open- and closed-source models and providers.

What This Notebook Does

Demonstrates how to use judges with litellm to evaluate AI search engine responses.
Uses LLaMA 3 (together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo) as the LLM evaluator to assess:
- Correctness (factual accuracy).
- Quality (clarity, helpfulness).
Provides a step-by-step workflow to evaluate outputs generated by search engines.

Open-Source Tools & Resources

Library: judges
Model: LLaMA 3.3 70B Instruct-Turbo via litellm
Dataset: Natural Questions Subset

Why This Notebook?

This notebook provides a practical example of using judges with an open-source model (LLaMA 3) to evaluate real-world AI outputs. It highlights the library's flexibility, ease of integration with litellm, and usefulness for benchmarking AI systems in a transparent, reproducible manner.

review-notebook-app · 2024-12-17T17:39:52Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jamesliounis · 2024-12-17T17:41:22Z

@merveenoyan @stevhliu

stevhliu · 2024-12-17T18:48:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1687 @@
+{


"judges is an open-source library to use and create LLM-as-a-Judge evaluators. It provides a set of curated, research-backed..."

"...a collection of real-world Google queries...as our benchmark for comparing..."

"...which only includes human evaluated answers and their corresponding queries for correctness, clarity, and completeness."

Reply via ReviewNB

stevhliu · 2024-12-17T18:48:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1687 @@
+{


I'd move the table of contents to the very top

Reply via ReviewNB

stevhliu · 2024-12-17T18:48:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1687 @@
+{


"...or from Google Colab secrets, in which case, uncomment the relevant code examples below."

Reply via ReviewNB

stevhliu · 2024-12-17T18:48:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1687 @@
+{


Maybe add this to the Perplexity section?

Reply via ReviewNB

stevhliu · 2024-12-17T18:48:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1687 @@
+{


Remove the last sentence

"...reading in our data that now contains the search results."

Reply via ReviewNB

stevhliu · 2024-12-17T18:48:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1687 @@
+{


Maybe clarify MTBenchChatBotResponseQuality is also a "grader" type of judge (not really clear right now). It can say something like "Response Quality Evaluation Grader"

Reply via ReviewNB

stevhliu

Very cool library! 👏

Remember to add your notebook to the toctree and modify index.md to also include your notebook (remove one of the older notebooks from it and replace it with yours) to the latest notebooks section.

jamesliounis · 2024-12-17T19:17:37Z

Hey Stephen! Thanks for the prompt feedback. I have incorporated your comments, added nb to toctree and to index.md.

stevhliu

Cool, thanks! Once @merveenoyan has had a chance to review, we can merge :)

stevhliu · 2024-12-17T20:34:29Z

notebooks/en/index.md

 - [Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs)](multimodal_rag_using_document_retrieval_and_vlms)
 - [Fine-Tuning a Vision Language Model (Qwen2-VL-7B) with the Hugging Face Ecosystem (TRL)](fine_tuning_vlm_trl)
 - [Multi-agent RAG System 🤖🤝🤖](multiagent_rag_system)
 - [Multimodal RAG with ColQwen2, Reranker, and Quantized VLMs on Consumer GPUs](multimodal_rag_using_document_retrieval_and_reranker_and_vlms)
 - [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl)
- [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm)


Sorry I wasn't clear, we should keep the most recent ones (towards the bottom) and remove the one on top (Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs))

Modified it. Thanks for pointing that out.

jamesliounis · 2025-01-05T16:32:37Z

@merveenoyan Happy New Year! Have you had the chance to take a look?

merveenoyan · 2025-01-09T12:38:41Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


We use the Natural Questions dataset -- a collection of real-world Google queries and corresponding Wikipedia articles -- as our benchmark for comparing the quality of different AI search engines, as follows:

this sentence is a bit too long and hard to follow, can we simplify it?

nit: open-source* (there's an s at the end)

Reply via ReviewNB

How about this?

We use the Natural Questions dataset as our benchmark for comparing the quality of different AI search engines. Natural Questions is a collection of real-world Google queries and corresponding Wikipedia articles. We'll walk through the following process:

merveenoyan · 2025-01-09T12:38:41Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


it seems we already do this below, so no need to mention

Reply via ReviewNB

Noted 👍 !

Removed it.

merveenoyan · 2025-01-09T12:38:42Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


can you explain why you picked this instead of local serving? we often do local serving with open-source models in open-source cookbook

Reply via ReviewNB

We picked this instead of local serving because it's a bit more lightweight and users don't need to have a machine available with local serving set up to get started. We'd love to add support for that in judges though in the future!

Added a sentence to explain.

merveenoyan

Left very minor nits, we can merge afterwards!
Sorry for the delay, I was off!

freddiev4 · 2025-01-09T19:40:15Z

Left very minor nits, we can merge afterwards! Sorry for the delay, I was off!

@merveenoyan 👋🏼 thanks for reviewing! I'm going to shepherd this PR the rest of the way from our team. Will respond to your comments above and make updates + open a new PR if that's ok.

freddiev4 · 2025-01-10T20:47:15Z

@stevhliu @merveenoyan #270!

Add notebook: Evaluating AI search engines with the judges library

48ae926

stevhliu reviewed Dec 17, 2024

View reviewed changes

jamesliounis added 3 commits December 17, 2024 21:08

deploy stevhliu fixes

1b98e78

add nb to toctree

c140910

add nb to index

8200bd9

stevhliu approved these changes Dec 17, 2024

View reviewed changes

reorganize nbs

c1d976f

merveenoyan reviewed Jan 9, 2025

View reviewed changes

add merveenoyan comments

85aee50

jamesliounis requested a review from merveenoyan January 9, 2025 19:56

jamesliounis closed this Jan 9, 2025

freddiev4 mentioned this pull request Jan 10, 2025

Add notebook for "Evaluating AI Search Engines with the judges Library" #270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notebook: Evaluating AI search engines with the judges library #257

Add notebook: Evaluating AI search engines with the judges library #257

jamesliounis commented Dec 17, 2024

review-notebook-app bot commented Dec 17, 2024

jamesliounis commented Dec 17, 2024

stevhliu Dec 17, 2024 •

edited

Loading

jamesliounis Dec 17, 2024

stevhliu Dec 17, 2024 •

edited

Loading

jamesliounis Dec 17, 2024

stevhliu Dec 17, 2024 •

edited

Loading

stevhliu Dec 17, 2024 •

edited

Loading

stevhliu Dec 17, 2024 •

edited

Loading

jamesliounis Dec 17, 2024

stevhliu Dec 17, 2024 •

edited

Loading

jamesliounis Dec 17, 2024

stevhliu left a comment

jamesliounis commented Dec 17, 2024

stevhliu left a comment

stevhliu Dec 17, 2024

jamesliounis Dec 20, 2024

jamesliounis commented Jan 5, 2025

merveenoyan Jan 9, 2025 •

edited

Loading

freddiev4 Jan 9, 2025

jamesliounis Jan 9, 2025

merveenoyan Jan 9, 2025 •

edited

Loading

freddiev4 Jan 9, 2025

jamesliounis Jan 9, 2025

merveenoyan Jan 9, 2025 •

edited

Loading

freddiev4 Jan 9, 2025

jamesliounis Jan 9, 2025

merveenoyan left a comment

freddiev4 commented Jan 9, 2025 •

edited

Loading

freddiev4 commented Jan 10, 2025

Add notebook: Evaluating AI search engines with the judges library #257

Add notebook: Evaluating AI search engines with the judges library #257

Conversation

jamesliounis commented Dec 17, 2024

Description

This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.

What is judges?

What This Notebook Does

Open-Source Tools & Resources

Why This Notebook?

review-notebook-app bot commented Dec 17, 2024

jamesliounis commented Dec 17, 2024

stevhliu Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevhliu Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevhliu Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

stevhliu Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

stevhliu Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevhliu Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevhliu left a comment

Choose a reason for hiding this comment

jamesliounis commented Dec 17, 2024

stevhliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesliounis commented Jan 5, 2025

merveenoyan Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merveenoyan Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merveenoyan Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merveenoyan left a comment

Choose a reason for hiding this comment

freddiev4 commented Jan 9, 2025 • edited Loading

freddiev4 commented Jan 10, 2025

stevhliu Dec 17, 2024 •

edited

Loading

stevhliu Dec 17, 2024 •

edited

Loading

stevhliu Dec 17, 2024 •

edited

Loading

stevhliu Dec 17, 2024 •

edited

Loading

stevhliu Dec 17, 2024 •

edited

Loading

stevhliu Dec 17, 2024 •

edited

Loading

merveenoyan Jan 9, 2025 •

edited

Loading

merveenoyan Jan 9, 2025 •

edited

Loading

merveenoyan Jan 9, 2025 •

edited

Loading

freddiev4 commented Jan 9, 2025 •

edited

Loading