Skip to content

An assessment of Zero-Shot Open Book Question Answering using Large Language Models

License

Notifications You must be signed in to change notification settings

BobMerkus/ADS-LLM-QA

Repository files navigation

Thesis ADS-LLM-QA

This repository contains the implementation of the Thesis for the Advanced Data Science (ADS) Master at the University of Amsterdam (UvA). The thesis is titled: "Question Answering for Legal Documents using Open-Domain Question Answering". The thesis is written by: `J. van der Heijden` and supervised by: `Dr. M. de Rijke` and `Dr. M. Cochez`.

The text above was generated by GitHub Copilot and is a good example of a generative Large Language Model (LLM) that is hallucinating. The goal of my thesis is to investigate the use of LLMs in an Open-Domain Question Answering (ODQA) setting.

Recommended Installation

We use the Windows Subsystem for Linux (WSL 2).

  1. Setup Python Environment
    • Conda: conda env create -f environment.yml
    • pip: pip install -r requirements.txt
  2. Setup Haystack Backend (ElasticSearch/FAISS). Start the ElasticSearch document store with a single node using launch_es() from the haystack.utils module Warning: Running an elasticsearch instance on WSL is not recommended. It is recommended to run elasticsearch on a separate machine (e.g., a virtual machine or a cloud instance). Running locally (with multiple nodes) requires a lot of memory. If you decide to anyway, a local cluster (with 3 nodes) can be run using docker compose up -d in a terminal and make sure a valid certificate is contained in ./data/ca.crt after setup. When encountering the error max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144], one should run: sysctl -w vm.max_map_count=262144.
  3. Create document store, write the documents (and generate embeddings if using a Neural retriever) + evaluation labels.
  4. Run inference for multiple models (Extractive + Generative). Models required for the analysis are not included in this repository, but are downloaded automatically through the Huggingface transformers library.

Optionally:

  • Setup OPEN_AI_KEY in ./data/OPEN_AI_KEY.txt if you want to use Open AI models via the API.

Main

main.py is run for the actual Thesis analysis, while runtime settings can be set with main.yml. It evaluates multiple Haystack pipelines against the AWS dataset. It contains the following functions:

  • document_import: Combines utils.aws.aws_docs_documents, utils.kubernetes.kubernetes_documents & utils.stackexchange.stackexchange_documents. The functions convert datasets to the haystack.Document format.
  • execute_pipeline: Executes a pipeline with a specific retriever + model (reader/generator) and saves the results to disk.
  • evaluate_answers: Evaluates a list of answers against a list of labels.
  • evaluate_pipeline: Evaluates a pipeline with a specific retriever + model and saves the results to disk.
  • main: Main function that runs the analysis. It imports the configuration from main.yml and optionally overwrites some arguments using the command line (run python main.py -h for information). It starts by importing the documents + evaluation labels and then runs the evaluate_pipeline function for each pipeline.
  • import_results: Reads the contents of all runtime results (configuration, pipeline output, evaluation metrics). Can be used for further analysis of the results.

Models

FARMReader or TransformerReader: Discussed in [https://docs.haystack.deepset.ai/docs/reader#deeper-dive-farm-vs-transformers] and [deepset-ai/haystack#248 (comment)] We use the TransformerReader for extractive models and PromptNode + PromptTemplate (LFQA) for generative models.

Utils module

Custom module containing utilities.

Natural Language Processing

The module utils.nlp.py:

Haystack Pre Processing

The file utils.haystack_pre_processing.py contains 2 functions for processing file formats to the haystack.Document format, which is needed for writing to the document store.

  • soup_to_documents: Convert a bs4.BeautifulSoup object to list of haystack.Document. Attempts to extract tables and convert them to key:value pairs in comma delimited fashion (instead of default parsing with newlines).
  • markdown_to_documents: Convert a markdown.markdown object to list of haystack.Document. Wraps around the above soup_to_documents function.

Amazon Web Services (AWS)

Custom module containing the AWS dataset in utils.aws.py as published on [https://github.com/siagholami/aws-documentation].

  • aws_docs_import_qa: Import the original QA dataset from the AWS dataset
  • aws_docs_files: Import the file paths + file names from the AWS dataset
  • aws_docs_documents: Import the AWS dataset using the haystack.Document format. This contains the raw text + meta data
  • aws_docs_labels: Import the AWS dataset using the haystack.Label format. This contains the question + answer + document and is used for evaluation.

Webscraper

Custom webscraper for obtaining data from web pages in utils.webscraper.py:

  • get_proxies_from_file: import proxies from .txt file
  • get: wrapper for requests.get that uses proxies + local clone
  • webcrawl: webcrawler that uses the get function to recursively webscrape a specific (sub) domain. Was used for Kubernetes documentation + blog
  • parse_pdf: parse a pdf to raw text

Kubernetes

The utils.kubernets.py is used for importing kubernetes related data.

  • kubernetes_documents converts a local .json file to a list of haystack.Document

User Interface

Gradio

Streamlit application of chatbot in app_gradio.py. Run with pytho app_gradio.py.

About

An assessment of Zero-Shot Open Book Question Answering using Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages