Thesis ADS-LLM-QA

This repository contains the implementation of the Thesis for the Advanced Data Science (ADS) Master at the University of Amsterdam (UvA). The thesis is titled: "Question Answering for Legal Documents using Open-Domain Question Answering". The thesis is written by: `J. van der Heijden` and supervised by: `Dr. M. de Rijke` and `Dr. M. Cochez`.

The text above was generated by GitHub Copilot and is a good example of a generative Large Language Model (LLM) that is hallucinating. The goal of my thesis is to investigate the use of LLMs in an Open-Domain Question Answering (ODQA) setting.

Recommended Installation

We use the Windows Subsystem for Linux (WSL 2).

Setup Python Environment
- Conda: conda env create -f environment.yml
- pip: pip install -r requirements.txt
Setup Haystack Backend (ElasticSearch/FAISS). Start the ElasticSearch document store with a single node using launch_es() from the haystack.utils module Warning: Running an elasticsearch instance on WSL is not recommended. It is recommended to run elasticsearch on a separate machine (e.g., a virtual machine or a cloud instance). Running locally (with multiple nodes) requires a lot of memory. If you decide to anyway, a local cluster (with 3 nodes) can be run using docker compose up -d in a terminal and make sure a valid certificate is contained in ./data/ca.crt after setup. When encountering the error max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144], one should run: sysctl -w vm.max_map_count=262144.
Create document store, write the documents (and generate embeddings if using a Neural retriever) + evaluation labels.
Run inference for multiple models (Extractive + Generative). Models required for the analysis are not included in this repository, but are downloaded automatically through the Huggingface transformers library.

Optionally:

Setup OPEN_AI_KEY in ./data/OPEN_AI_KEY.txt if you want to use Open AI models via the API.

Main

main.py is run for the actual Thesis analysis, while runtime settings can be set with main.yml. It evaluates multiple Haystack pipelines against the AWS dataset. It contains the following functions:

document_import: Combines utils.aws.aws_docs_documents, utils.kubernetes.kubernetes_documents & utils.stackexchange.stackexchange_documents. The functions convert datasets to the haystack.Document format.
execute_pipeline: Executes a pipeline with a specific retriever + model (reader/generator) and saves the results to disk.
evaluate_answers: Evaluates a list of answers against a list of labels.
evaluate_pipeline: Evaluates a pipeline with a specific retriever + model and saves the results to disk.
main: Main function that runs the analysis. It imports the configuration from main.yml and optionally overwrites some arguments using the command line (run python main.py -h for information). It starts by importing the documents + evaluation labels and then runs the evaluate_pipeline function for each pipeline.
import_results: Reads the contents of all runtime results (configuration, pipeline output, evaluation metrics). Can be used for further analysis of the results.

Models

FARMReader or TransformerReader: Discussed in [https://docs.haystack.deepset.ai/docs/reader#deeper-dive-farm-vs-transformers] and [deepset-ai/haystack#248 (comment)] We use the TransformerReader for extractive models and PromptNode + PromptTemplate (LFQA) for generative models.

Utils module

Custom module containing utilities.

Natural Language Processing

The module utils.nlp.py:

hash_md5: Hash a string using the MD5 alrogithm
match: Regex matching for URLs, Phone Numbers, Etc.
normalize_answer: Normalize a answer for NLP evaluation. Sourced from SQuAD V2 script: [https://github.com/white127/SQUAD-2.0-bidaf/blob/master/evaluate-v2.0.py]

Haystack Pre Processing

The file utils.haystack_pre_processing.py contains 2 functions for processing file formats to the haystack.Document format, which is needed for writing to the document store.

soup_to_documents: Convert a bs4.BeautifulSoup object to list of haystack.Document. Attempts to extract tables and convert them to key:value pairs in comma delimited fashion (instead of default parsing with newlines).
markdown_to_documents: Convert a markdown.markdown object to list of haystack.Document. Wraps around the above soup_to_documents function.

Amazon Web Services (AWS)

Custom module containing the AWS dataset in utils.aws.py as published on [https://github.com/siagholami/aws-documentation].

aws_docs_import_qa: Import the original QA dataset from the AWS dataset
aws_docs_files: Import the file paths + file names from the AWS dataset
aws_docs_documents: Import the AWS dataset using the haystack.Document format. This contains the raw text + meta data
aws_docs_labels: Import the AWS dataset using the haystack.Label format. This contains the question + answer + document and is used for evaluation.

Webscraper

Custom webscraper for obtaining data from web pages in utils.webscraper.py:

get_proxies_from_file: import proxies from .txt file
get: wrapper for requests.get that uses proxies + local clone
webcrawl: webcrawler that uses the get function to recursively webscrape a specific (sub) domain. Was used for Kubernetes documentation + blog
parse_pdf: parse a pdf to raw text

Kubernetes

The utils.kubernets.py is used for importing kubernetes related data.

kubernetes_documents converts a local .json file to a list of haystack.Document

User Interface

Gradio

Streamlit application of chatbot in app_gradio.py. Run with pytho app_gradio.py.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
utils		utils
.env		.env
.gitignore		.gitignore
An assessment of Zero-Shot Open Book Question Answering using Large Language Models - Bob Merkus.pdf		An assessment of Zero-Shot Open Book Question Answering using Large Language Models - Bob Merkus.pdf
LICENSE		LICENSE
README.MD		README.MD
app_gradio.py		app_gradio.py
bot.py		bot.py
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
main.py		main.py
main.yml		main.yml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thesis ADS-LLM-QA

Recommended Installation

Main

Models

Utils module

Natural Language Processing

Haystack Pre Processing

Amazon Web Services (AWS)

Webscraper

Kubernetes

User Interface

Gradio

About

Releases

Packages

Languages

License

BobMerkus/ADS-LLM-QA

Folders and files

Latest commit

History

Repository files navigation

Thesis ADS-LLM-QA

Recommended Installation

Main

Models

Utils module

Natural Language Processing

Haystack Pre Processing

Amazon Web Services (AWS)

Webscraper

Kubernetes

User Interface

Gradio

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages