Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created Python Code to set up the entire pipeline for benchmarking faiss-index with DuckDB and PGVector #1984

Open
wants to merge 29 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
95e7b02
added benchmarking files and instructions
Jun 14, 2024
557f560
adopted hnsw index and fixed incorrect sql queries
Jun 14, 2024
dd128a6
uncommented code
Jun 18, 2024
2206d78
Merge pull request #1 from SeanSong25/fixing_pgvector_benchmark
SeanSong25 Jun 18, 2024
22f9f85
modified the benchmark files to record performance
Jul 3, 2024
fd4db15
Merge branch 'castorini:master' into master
SeanSong25 Jul 3, 2024
6ba014d
removed unnecessary commenting characters
Jul 3, 2024
67d86ca
faiss_to_pgvector file created
Jul 22, 2024
a1e3ed0
updated git ignore to ignore venv files
songxiaoyansean Jul 22, 2024
b44f5dc
Merge branch 'performance_benchmark' of https://github.com/SeanSong25…
songxiaoyansean Jul 22, 2024
3b3fb28
Merge pull request #2 from SeanSong25/performance_benchmark
SeanSong25 Jul 31, 2024
bc0669c
refactored benchmark scripts, and added vector extraction tool
songxiaoyansean Jul 31, 2024
b077548
basic set up for benchmarking
Sep 11, 2024
4af3d03
updated doc, included instruction on running full msmarco dataset
Sep 12, 2024
82e3527
updated instruction
Sep 12, 2024
dff3006
Merge pull request #4 from SeanSong25/performance_benchmark
SeanSong25 Sep 12, 2024
744e4ec
discarded unneeded files
Sep 12, 2024
894f67b
addressed comments by reorganizing files and adding .gitkeep
Sep 14, 2024
587ccef
Merge pull request #5 from SeanSong25/performance_benchmark
SeanSong25 Sep 14, 2024
410ad2e
added git keep content
Sep 14, 2024
92d9444
Merge pull request #6 from SeanSong25/performance_benchmark
SeanSong25 Sep 14, 2024
0970e93
filename change
Sep 14, 2024
773f650
Merge pull request #7 from SeanSong25/performance_benchmark
SeanSong25 Sep 14, 2024
168b084
modified instruction file to reflect file name changes
Sep 14, 2024
36a3490
deleted unneeded files, and cleanup naming
Sep 14, 2024
3ae4b18
changed relative file paths for benchmarks
Sep 14, 2024
f51ee6a
cleaned up code
Sep 22, 2024
df29a31
Updated the instructions doc to contain set by step guide
Sep 29, 2024
f455679
added instructions on where to find the result
Sep 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ collections/*
indexes/*
.vscode/
venv/
*.duckdb
*trec_dot_product*
*msmarco_benchmark_results*

# build directories from `python3 setup.py sdist bdist_wheel`
build/
Expand All @@ -19,3 +22,12 @@ runs/

# logs should also be ignored
logs/

# tools
tools/

# binaries should also be ignored
bin/*
lib*
pyvenv*
share*
175 changes: 175 additions & 0 deletions docs/experiment-vectordbs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Overview
We are going to run benchmarks for MSMarco and NFCorpus using DuckDB and PGVector on HNSW indexes.

# MSMarco

## Data Prep
Similar to the onboarding docs, we must first download and setup the collections and indexes if they are not already downloaded. Except this time, we only need to index the queries, the index itself will be downloaded and extracted by the faiss_index_extractor. First, we need to download the MSMarco Dataset.

```bash
mkdir collections/msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage
```

Next, we need to convert the MS MARCO tsv queries into Pyserini's jsonl files (which have one json object per line):

```bash
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-passage/queries.dev.small.tsv \
--output-folder collections/msmarco-passage/queries_jsonl
```

Now, we need to convert the jsonl queries into a faiss index.

```bash
python -m pyserini.encode \
input --corpus collections/msmarco-passage/queries_jsonl \
output --embeddings collections/msmarco-passage/queries_faiss \
--to-faiss \
encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
--device cpu \
--pooling mean \
--batch 32
```

Now, after the data is prepared, we can run the benchmark on DuckDB.

# Database setup
First, activate a Conda environment, for example your pyserini environment.

```bash
conda activate pyserini
```
## DuckDB
Duckdb is relatively easy to set up, as it is an in-memory database that can be embedded into a process. Therefore, you only need to install this database via commandline:
```
pip install duckdb
```
Then, you can simply run the following, to run the benchmark.

```
$ python3 vectordb_benchmark/run_benchmark.py \
--index_name='msmarco-v1-passage.bge-base-en-v1.5' \
--table_name='msmarco' \
--metric='ip' \
--query_index_path='collections/msmarco-passage/queries_faiss' \
--db_type='duckdb' \
--db_config_file='./scripts/vectordb_benchmark/duckdb_db_config.txt'
```
The db_config_file should be a text file, it specifies how much memory you would allow DuckDB to allocate, you can modify this file if you want, by default the memory limit is 100GB. The entire process may take over a day to complete, depending on your hardware set up. This code will download the index, extract the embedded vectors of the index, build the table in duckdb and run the benchmark. Alternatively, you can run the script `./scripts/vectordb_benchmark/benchmark_msmarco_duckdb.sh` to run the benchmark.

## PGVector
Now that we have DuckDB experiment finished, we can run the same experiment on PGVector. PGVector is an extension of PostgreSQL, so you will need to install both PostgreSQL and PGVector for this experiment.

# Install PostgreSQL
```bash
conda install -c conda-forge postgresql
```

# Install PGVector
To manually install pgvector, first install the necessary build tools (gcc and make) using Conda:
```bash
conda install -c conda-forge gcc_linux-64 make
```
Then, you can clone the pgvector repository, and make and install the extension.

```bash
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make PG_CONFIG=$(which pg_config)
make install
```

After the installation, verify that the pgvector.control file and library were installed correctly:
```bash
ls $(pg_config --sharedir)/extension/pgvector.control
ls $(pg_config --pkglibdir)/vector.so
```
If both files are present, the installation was successful.

# Start Database Server
Now that installations are done, you can finally initialize the database, create the vector extension, create a user and database for your experiment and start your postgresql server. The script `vectordb_benchmark/init_and_start_postgres.sh` will do all of these for you, it will initialize the database, create a database called `main_database` and a user called `main_user`, and enable the vector extension. Therefore, you can simply run:
```bash
./init_and_start_postgres.sh ~/pgdata
```
and your postgresql server will be up and running on port 5432. The only argument is the directory for the database data, you can modify this if you want.

# Run the Benchmark
Now that you have the PGVector extension installed and enabled in PostgreSQL. You can start running the benchmark, but first, make sure you supply the correct database configuration in the `pgvector_db_config.txt` file. For example, by default:

```
dbname: main_db
user: main_user
password: 123456
host: localhost
port: 5432
```

Then, you can run the benchmark by running the following command.

```
$ python3 vectordb_benchmark/run_benchmark.py \
--index_name='msmarco-v1-passage.bge-base-en-v1.5' \
--table_name='msmarco' \
--metric='ip' \
--query_index_path='collections/msmarco-passage/queries_faiss' \
--db_type='pgvector' \
--db_config_file='./scripts/vectordb_benchmark/pgvector_db_config.txt' \
```
or simply run the script `./scripts/vectordb_benchmark/benchmark_msmarco_pgvector.sh`

Note that after one run, your postgresql will contain the table data, the current behaviour is to drop the table and index if they exist when the benchmark started. Later, we will add an option to skip table creation and index building, so that you can run the benchmark multiple times without having to re-create the table and index every time.

# Results
Too view the output of the benchmark, you can check the `msmarco_benchmark_results.txt` file in the `scripts/vectordb_benchmark` folder. It contains the Total time, the mean, variance, min, max time to run a single query on the HNSW index built in the vectordb, as well as the actual ndcg@10 result and verbose output of the trec evaluation tool. The raw trec evaluation output is in the file `trec_dot_product_output.txt` in the top level directory

# NFCorpus

## Data Prep
Similar to the onboarding docs, we must first download the NFCorpus Dataset.

```bash
wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip -P collections
unzip collections/nfcorpus.zip -d collections
```

## 1. Encode the Corpus
Create a directory for document embeddings and encode the corpus using the specified encoder.

```bash
mkdir indexes/faiss-nfcorpus
mkdir indexes/faiss-nfcorpus/documents
python -m pyserini.encode \
input --corpus collections/nfcorpus/corpus.jsonl \
output --embeddings indexes/faiss-nfcorpus/documents \
encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
--device cpu \
--pooling mean \
--batch 32
```

## 2. Encode the Queries
Create a directory for query embeddings and encode the queries using the specified encoder.

```bash
mkdir indexes/faiss-nfcorpus/queries
python -m pyserini.encode \
input --corpus collections/nfcorpus/queries.jsonl \
output --embeddings indexes/faiss-nfcorpus/queries \
encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
--device cpu \
--pooling mean \
--batch 32
```

## 3. Run Benchmarks

```bash
python3 ./scripts/vectordb_benchmark/benchmark_nfcorpus_duckdb.py
python3 ./scripts/vectordb_benchmark/benchmark_nfcorpus_pgvector.py
Empty file modified scripts/msmarco-passage/encode_queries.py
100644 → 100755
Empty file.
7 changes: 7 additions & 0 deletions scripts/vectordb_benchmark/benchmark_msmarco_duckdb.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
python3 ./run_benchmark.py \
--index_name='msmarco-v1-passage.bge-base-en-v1.5' \
--table_name='msmarco' \
--metric='ip' \
--query_index_path='../../collections/msmarco-passage/queries_faiss' \
--db_type='duckdb' \
--db_config_file='duckdb_db_config.txt'
7 changes: 7 additions & 0 deletions scripts/vectordb_benchmark/benchmark_msmarco_pgvector.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
python3 ./run_benchmark.py \
--index_name='msmarco-v1-passage.bge-base-en-v1.5' \
--table_name='msmarco' \
--metric='ip' \
--query_index_path='../../collections/msmarco-passage/queries_faiss' \
--db_type='pgvector' \
--db_config_file='pgvector_db_config.txt'
134 changes: 134 additions & 0 deletions scripts/vectordb_benchmark/benchmark_nfcorpus_duckdb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
import json
import duckdb
import numpy as np
import subprocess
import time

# Paths to embedding, query, and output files
DOCUMENT_JSONL_FILE_PATH = '../../indexes/faiss-nfcorpus/documents/embeddings.jsonl'
QUERY_JSONL_FILE_PATH = '../../indexes/faiss-nfcorpus/queries/embeddings.jsonl'
TREC_DOT_PRODUCT_OUTPUT_FILE_PATH = '../../runs/.run-faiss-nfcorpus-result_dot_product.txt'
TREC_COSINE_OUTPUT_FILE_PATH = '../../runs/.run-faiss-nfcorpus-result_cosine.txt'
TREC_L2SQ_OUTPUT_FILE_PATH = '../../runs/.run-faiss-nfcorpus-result_l2sq.txt'
K = 10 # Number of nearest neighbors to retrieve
RUN_ID = "DuckDBHNSW" # Identifier for the run

def get_vector_size(jsonl_file_path):
"""Determines the size of the vector, assuming vectors all have the same dimension."""
with open(jsonl_file_path, 'r') as file:
for line in file:
data = json.loads(line)
vector = data.get('vector', [])
return len(vector)
return 0

def insert_data_into_table(con, id, content, vector, table):
"""Inserts data into the DuckDB table."""
con.execute(f"INSERT INTO {table} (id, content, vector) VALUES (?, ?, ?)", (id, content, vector))

def setup_database():
"""Sets up the DuckDB database and inserts document data."""
con = duckdb.connect(database=':memory:')
con.execute("INSTALL vss")
con.execute("LOAD vss")
con.execute("PRAGMA temp_directory='/tmp/duckdb_temp'")
con.execute("PRAGMA memory_limit='4GB'")

vector_size = get_vector_size(DOCUMENT_JSONL_FILE_PATH)
print(f"Vector size: {vector_size}")

# Create documents table
con.execute(f"""
CREATE TABLE documents (
id STRING,
content STRING,
vector FLOAT[{vector_size}]
)
""")

# Insert data from JSONL file
with open(DOCUMENT_JSONL_FILE_PATH, 'r') as file:
for line in file:
data = json.loads(line)
insert_data_into_table(con, data['id'], data['contents'], data['vector'], 'documents')

# Create HNSW indices with different metrics
# print the time taken for each index building
start_time = time.time()
con.execute("CREATE INDEX l2sq_idx ON documents USING HNSW(vector) WITH (metric = 'l2sq')")
print('building l2sq index: ', time.time() - start_time)
start_time = time.time()
con.execute("CREATE INDEX cos_idx ON documents USING HNSW(vector) WITH (metric = 'cosine')")
print('building cosine index: ', time.time() - start_time)
start_time = time.time()
con.execute("CREATE INDEX ip_idx ON documents USING HNSW(vector) WITH (metric = 'ip')")
print('building ip index: ', time.time() - start_time)

return con

def run_trec_eval(trec_output_file_path):
"""Runs TREC evaluation and prints ndcg@10."""
command = [
"python", "-m", "pyserini.eval.trec_eval",
"-c", "-m", "ndcg_cut.10",
"../../collections/nfcorpus/qrels/test.qrels",
trec_output_file_path
]
print("ndcg@10 for ", trec_output_file_path)
subprocess.run(command)

def run_benchmark(con, trec_output_file_path, metric):
"""Runs the benchmark and writes results in TREC format."""
query_times = []
with open(trec_output_file_path, 'w') as trec_file:
with open(QUERY_JSONL_FILE_PATH, 'r') as query_file:
for line in query_file:
data = json.loads(line)
query_id = data['id']
vector = data['vector']

# Select appropriate SQL query based on the metric
if metric == 'l2sq':
evaluation_metric = 'array_distance'
elif metric == 'cosine':
evaluation_metric = 'array_cosine_similarity'
elif metric == 'ip':
evaluation_metric = 'array_inner_product'

sql_query = f"SELECT id, {evaluation_metric}(vector, ?::FLOAT[{len(vector)}]) as score FROM documents ORDER BY score DESC LIMIT ?"
# time the execution
start_time = time.time()
results = con.execute(sql_query, (vector, K)).fetchall()
end_time = time.time()

# Calculate the time for this query and add it to the list
query_time = end_time - start_time
query_times.append(query_time)

# Write results in TREC format
for rank, (doc_id, score) in enumerate(results, start=1):
trec_file.write(f"{query_id} Q0 {doc_id} {rank} {score} {RUN_ID}\n")

print(f"TREC results written to {trec_output_file_path}")
run_trec_eval(trec_output_file_path)
# Aggregate statistics
total_time = sum(query_times)
mean_time = np.mean(query_times)
variance_time = np.var(query_times)
min_time = min(query_times)
max_time = max(query_times)
return total_time, mean_time, variance_time, min_time, max_time

if __name__ == "__main__":
con = setup_database()

# Running the benchmarks
print('l2sq: ', run_benchmark(con, TREC_L2SQ_OUTPUT_FILE_PATH, 'l2sq'))
print('cosine: ', run_benchmark(con, TREC_COSINE_OUTPUT_FILE_PATH, 'cosine'))
print('ip: ', run_benchmark(con, TREC_DOT_PRODUCT_OUTPUT_FILE_PATH, 'ip'))

# second run
print("second run")
print('l2sq: ', run_benchmark(con, TREC_L2SQ_OUTPUT_FILE_PATH, 'l2sq'))
print('cosine: ', run_benchmark(con, TREC_COSINE_OUTPUT_FILE_PATH, 'cosine'))
print('ip: ', run_benchmark(con, TREC_DOT_PRODUCT_OUTPUT_FILE_PATH, 'ip'))
Loading