This repository contains the implementation of the Thesis for the Advanced Data Science (ADS) Master at the University of Amsterdam (UvA). The thesis is titled: "Question Answering for Legal Documents using Open-Domain Question Answering". The thesis is written by: `J. van der Heijden` and supervised by: `Dr. M. de Rijke` and `Dr. M. Cochez`.
The text above was generated by GitHub Copilot and is a good example of a generative Large Language Model (LLM) that is hallucinating. The goal of my thesis is to investigate the use of LLMs in an Open-Domain Question Answering (ODQA) setting.
We use the Windows Subsystem for Linux (WSL 2).
- Setup Python Environment
- Conda:
conda env create -f environment.yml
- pip:
pip install -r requirements.txt
- Conda:
- Setup Haystack Backend (ElasticSearch/FAISS). Start the ElasticSearch document store with a single node using
launch_es()
from thehaystack.utils
module Warning: Running an elasticsearch instance on WSL is not recommended. It is recommended to run elasticsearch on a separate machine (e.g., a virtual machine or a cloud instance). Running locally (with multiple nodes) requires a lot of memory. If you decide to anyway, a local cluster (with 3 nodes) can be run usingdocker compose up -d
in a terminal and make sure a valid certificate is contained in./data/ca.crt
after setup. When encountering the errormax virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
, one should run:sysctl -w vm.max_map_count=262144
. - Create document store, write the documents (and generate embeddings if using a Neural retriever) + evaluation labels.
- Run inference for multiple models (Extractive + Generative). Models required for the analysis are not included in this repository, but are downloaded automatically through the Huggingface
transformers
library.
Optionally:
- Setup OPEN_AI_KEY in
./data/OPEN_AI_KEY.txt
if you want to use Open AI models via the API.
main.py
is run for the actual Thesis analysis, while runtime settings can be set with main.yml
. It evaluates multiple Haystack pipelines against the AWS dataset. It contains the following functions:
document_import
: Combinesutils.aws.aws_docs_documents
,utils.kubernetes.kubernetes_documents
&utils.stackexchange.stackexchange_documents
. The functions convert datasets to thehaystack.Document
format.execute_pipeline
: Executes a pipeline with a specific retriever + model (reader/generator) and saves the results to disk.evaluate_answers
: Evaluates a list of answers against a list of labels.evaluate_pipeline
: Evaluates a pipeline with a specific retriever + model and saves the results to disk.main
: Main function that runs the analysis. It imports the configuration frommain.yml
and optionally overwrites some arguments using the command line (runpython main.py -h
for information). It starts by importing the documents + evaluation labels and then runs theevaluate_pipeline
function for each pipeline.import_results
: Reads the contents of all runtime results (configuration, pipeline output, evaluation metrics). Can be used for further analysis of the results.
FARMReader
or TransformerReader
: Discussed in [https://docs.haystack.deepset.ai/docs/reader#deeper-dive-farm-vs-transformers] and [deepset-ai/haystack#248 (comment)]
We use the TransformerReader
for extractive models and PromptNode
+ PromptTemplate
(LFQA) for generative models.
Custom module containing utilities.
The module utils.nlp.py
:
hash_md5
: Hash a string using the MD5 alrogithmmatch
: Regex matching for URLs, Phone Numbers, Etc.normalize_answer
: Normalize a answer for NLP evaluation. Sourced from SQuAD V2 script: [https://github.com/white127/SQUAD-2.0-bidaf/blob/master/evaluate-v2.0.py]
The file utils.haystack_pre_processing.py
contains 2 functions for processing file formats to the haystack.Document
format, which is needed for writing to the document store.
soup_to_documents
: Convert abs4.BeautifulSoup
object to list ofhaystack.Document
. Attempts to extract tables and convert them to key:value pairs in comma delimited fashion (instead of default parsing with newlines).markdown_to_documents
: Convert amarkdown.markdown
object to list ofhaystack.Document
. Wraps around the abovesoup_to_documents
function.
Custom module containing the AWS dataset in utils.aws.py
as published on [https://github.com/siagholami/aws-documentation].
aws_docs_import_qa
: Import the original QA dataset from the AWS datasetaws_docs_files
: Import the file paths + file names from the AWS datasetaws_docs_documents
: Import the AWS dataset using thehaystack.Document
format. This contains the raw text + meta dataaws_docs_labels
: Import the AWS dataset using thehaystack.Label
format. This contains the question + answer + document and is used for evaluation.
Custom webscraper for obtaining data from web pages in utils.webscraper.py
:
get_proxies_from_file
: import proxies from .txt fileget
: wrapper forrequests.get
that uses proxies + local clonewebcrawl
: webcrawler that uses theget
function to recursively webscrape a specific (sub) domain. Was used for Kubernetes documentation + blogparse_pdf
: parse a pdf to raw text
The utils.kubernets.py
is used for importing kubernetes related data.
kubernetes_documents
converts a local .json file to a list ofhaystack.Document
Streamlit application of chatbot in app_gradio.py
. Run with pytho app_gradio.py
.