Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook: Evaluating AI search engines with the judges library #257

Conversation

jamesliounis
Copy link


Description

This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.

What is judges?

judges is an open-source library that provides researched-backed, ready-to-use LLM-based evaluators for assessing outputs across various dimensions such as correctness, quality, and harmfulness. It supports both:

  • Classifiers (binary evaluations like True/False).
  • Graders (scored evaluations on numerical scales).
    The library also provides an integration with litellm, allowing access to most open- and closed-source models and providers.

What This Notebook Does

  1. Demonstrates how to use judges with litellm to evaluate AI search engine responses.
  2. Uses LLaMA 3 (together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo) as the LLM evaluator to assess:
    • Correctness (factual accuracy).
    • Quality (clarity, helpfulness).
  3. Provides a step-by-step workflow to evaluate outputs generated by search engines.

Open-Source Tools & Resources


Why This Notebook?

This notebook provides a practical example of using judges with an open-source model (LLaMA 3) to evaluate real-world AI outputs. It highlights the library's flexibility, ease of integration with litellm, and usefulness for benchmarking AI systems in a transparent, reproducible manner.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jamesliounis
Copy link
Author

@merveenoyan @stevhliu

@@ -0,0 +1,1687 @@
{
Copy link
Member

@stevhliu stevhliu Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"judges is an open-source library to use and create LLM-as-a-Judge evaluators. It provides a set of curated, research-backed..."

"...a collection of real-world Google queries...as our benchmark for comparing..."

"...which only includes human evaluated answers and their corresponding queries for correctness, clarity, and completeness."


Reply via ReviewNB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,1687 @@
{
Copy link
Member

@stevhliu stevhliu Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move the table of contents to the very top


Reply via ReviewNB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,1687 @@
{
Copy link
Member

@stevhliu stevhliu Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...or from Google Colab secrets, in which case, uncomment the relevant code examples below."


Reply via ReviewNB

@@ -0,0 +1,1687 @@
{
Copy link
Member

@stevhliu stevhliu Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add this to the Perplexity section?


Reply via ReviewNB

@@ -0,0 +1,1687 @@
{
Copy link
Member

@stevhliu stevhliu Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the last sentence

"...reading in our data that now contains the search results."


Reply via ReviewNB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,1687 @@
{
Copy link
Member

@stevhliu stevhliu Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify MTBenchChatBotResponseQuality is also a "grader" type of judge (not really clear right now). It can say something like "Response Quality Evaluation Grader"


Reply via ReviewNB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool library! 👏

Remember to add your notebook to the toctree and modify index.md to also include your notebook (remove one of the older notebooks from it and replace it with yours) to the latest notebooks section.

@jamesliounis
Copy link
Author

Hey Stephen! Thanks for the prompt feedback. I have incorporated your comments, added nb to toctree and to index.md.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks! Once @merveenoyan has had a chance to review, we can merge :)

- [Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs)](multimodal_rag_using_document_retrieval_and_vlms)
- [Fine-Tuning a Vision Language Model (Qwen2-VL-7B) with the Hugging Face Ecosystem (TRL)](fine_tuning_vlm_trl)
- [Multi-agent RAG System 🤖🤝🤖](multiagent_rag_system)
- [Multimodal RAG with ColQwen2, Reranker, and Quantized VLMs on Consumer GPUs](multimodal_rag_using_document_retrieval_and_reranker_and_vlms)
- [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl)
- [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear, we should keep the most recent ones (towards the bottom) and remove the one on top (Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified it. Thanks for pointing that out.

@jamesliounis
Copy link
Author

@merveenoyan Happy New Year! Have you had the chance to take a look?

@@ -0,0 +1,1680 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the Natural Questions dataset -- a collection of real-world Google queries and corresponding Wikipedia articles -- as our benchmark for comparing the quality of different AI search engines, as follows:

this sentence is a bit too long and hard to follow, can we simplify it?

nit: open-source* (there's an s at the end)


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

We use the Natural Questions dataset as our benchmark for comparing the quality of different AI search engines. Natural Questions is a collection of real-world Google queries and corresponding Wikipedia articles. We'll walk through the following process:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,1680 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we already do this below, so no need to mention


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted 👍 !

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it.

@@ -0,0 +1,1680 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why you picked this instead of local serving? we often do local serving with open-source models in open-source cookbook


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We picked this instead of local serving because it's a bit more lightweight and users don't need to have a machine available with local serving set up to get started. We'd love to add support for that in judges though in the future!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence to explain.

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left very minor nits, we can merge afterwards!
Sorry for the delay, I was off!

@freddiev4
Copy link

freddiev4 commented Jan 9, 2025

Left very minor nits, we can merge afterwards! Sorry for the delay, I was off!

@merveenoyan 👋🏼 thanks for reviewing! I'm going to shepherd this PR the rest of the way from our team. Will respond to your comments above and make updates + open a new PR if that's ok.

@freddiev4
Copy link

@stevhliu @merveenoyan #270!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants