Skip to content

Checking data with LLM

Karlis Kanders edited this page Jan 22, 2025 · 2 revisions

You can use the llm module to run a quick check over your data and extract relevant information.

The following approach will use an OpenAI large language model (LLM) to do this check for you. The default model is gpt-4o-mini which has a good trade off between accuracy and cost. Avoid processing sensitive data.

Our initial, very limited tests show that this method achieves approximately 75-85% agreement with a human reviewer on real-world relevance checks. The accuracy will likely depend on the ambiguity of the scope.

Tutorial

Step 1: Import the dependencies

from discovery_utils.utils.llm.batch_check import LLMProcessor
import pandas as pd

Step 2: Define your check.

For example, here we define a simple classifier to determine whether a text is about heat pumps, and to also extract what technology is mentioned in the text.

system_message = "Determine whether this text is about heat pumps"

fields = [
    {"name": "is_relevant", "type": "str", "description": "A one-word answer: 'yes' or 'no'."},
    {"name": "technology", "type": "str", "description": "What technology is mentioned in the text?"},
]

Step 3: Run the check

This will be our test data

test_data = {
    "id2": "One type of low carbon heating is a heat pump",
    "id1": "Heat pumps are efficient",
    "id3": "Hydrogen boilers is another type of heating technology",
}

Initialise the LLMProcessor and run it

processor = LLMProcessor(
    output_path="output.jsonl", # path to the output file
    system_message=system_message, # system message
    session_name="heat_pump_test", # used for tracking usage on LangFuse
    output_fields=fields, # your output data fields
)

processor.run(test_data)

You can adjust batch_size and sleep_time parameters of the run() function to optimise the processing time (ie, you may slightly increase the bath_size).

Using scope configs

If you're using a scope config file (as you would do when using discovery_utils.utils.search module), then you can create a system message directly from the config file

from discovery_utils.utils.llm.batch_check import generate_system_message

system_message = generate_system_message("config.yaml")