-
Notifications
You must be signed in to change notification settings - Fork 0
Checking data with LLM
You can use the llm
module to run a quick check over your data and extract relevant information.
The following approach will use an OpenAI large language model (LLM) to do this check for you. The default model is gpt-4o-mini
which has a good trade off between accuracy and cost. Avoid processing sensitive data.
Our initial, very limited tests show that this method achieves approximately 75-85% agreement with a human reviewer on real-world relevance checks. The accuracy will likely depend on the ambiguity of the scope.
Step 1: Import the dependencies
from discovery_utils.utils.llm.batch_check import LLMProcessor
import pandas as pd
Step 2: Define your check.
For example, here we define a simple classifier to determine whether a text is about heat pumps, and to also extract what technology is mentioned in the text.
system_message = "Determine whether this text is about heat pumps"
fields = [
{"name": "is_relevant", "type": "str", "description": "A one-word answer: 'yes' or 'no'."},
{"name": "technology", "type": "str", "description": "What technology is mentioned in the text?"},
]
Step 3: Run the check
This will be our test data
test_data = {
"id2": "One type of low carbon heating is a heat pump",
"id1": "Heat pumps are efficient",
"id3": "Hydrogen boilers is another type of heating technology",
}
Initialise the LLMProcessor
and run it
processor = LLMProcessor(
output_path="output.jsonl", # path to the output file
system_message=system_message, # system message
session_name="heat_pump_test", # used for tracking usage on LangFuse
output_fields=fields, # your output data fields
)
processor.run(test_data)
You can adjust batch_size
and sleep_time
parameters of the run()
function to optimise the processing time (ie, you may slightly increase the bath_size
).
If you're using a scope config file (as you would do when using discovery_utils.utils.search
module), then you can create a system message directly from the config file
from discovery_utils.utils.llm.batch_check import generate_system_message
system_message = generate_system_message("config.yaml")