castorini · SeanSong25 · Jun 14, 2024 · Jun 14, 2024 · Jun 18, 2024 · Jun 18, 2024
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,9 @@ collections/*
 indexes/*
 .vscode/
 venv/
+*.duckdb
+*trec_dot_product*
+*msmarco_benchmark_results*
 
 # build directories from `python3 setup.py sdist bdist_wheel`
 build/
@@ -19,3 +22,12 @@ runs/
 
 # logs should also be ignored
 logs/
+
+# tools
+tools/
+
+# binaries should also be ignored
+bin/*
+lib*
+pyvenv*
+share*
diff --git a/docs/experiment-vectordbs.md b/docs/experiment-vectordbs.md
@@ -0,0 +1,175 @@
+# Overview
+We are going to run benchmarks for MSMarco and NFCorpus using DuckDB and PGVector on HNSW indexes.
+
+# MSMarco
+
+## Data Prep
+Similar to the onboarding docs, we must first download and setup the collections and indexes if they are not already downloaded. Except this time, we only need to index the queries, the index itself will be downloaded and extracted by the faiss_index_extractor. First, we need to download the MSMarco Dataset.
+
+```bash
+mkdir collections/msmarco-passage
+
+wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage
+
+# Alternative mirror:
+# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage
+
+tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage
+```
+
+Next, we need to convert the MS MARCO tsv queries into Pyserini's jsonl files (which have one json object per line):
+
+```bash
+python tools/scripts/msmarco/convert_collection_to_jsonl.py \
+ --collection-path collections/msmarco-passage/queries.dev.small.tsv \
+ --output-folder collections/msmarco-passage/queries_jsonl
+```
+
+Now, we need to convert the jsonl queries into a faiss index.
+
+```bash
+python -m pyserini.encode \
+  input   --corpus collections/msmarco-passage/queries_jsonl \
+  output  --embeddings collections/msmarco-passage/queries_faiss \
+          --to-faiss \
+  encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
+          --device cpu \
+          --pooling mean \
+          --batch 32
+```
+
+Now, after the data is prepared, we can run the benchmark on DuckDB.
+
+# Database setup
+First, activate a Conda environment, for example your pyserini environment.
+
+```bash
+conda activate pyserini
+```
+## DuckDB
+Duckdb is relatively easy to set up, as it is an in-memory database that can be embedded into a process. Therefore, you only need to install this database via commandline: 
+```
+pip install duckdb
+```
+Then, you can simply run the following, to run the benchmark. 
+
+```
+$ python3 vectordb_benchmark/run_benchmark.py \
+        --index_name='msmarco-v1-passage.bge-base-en-v1.5' \
+        --table_name='msmarco' \
+        --metric='ip' \
+        --query_index_path='collections/msmarco-passage/queries_faiss' \
+        --db_type='duckdb' \
+        --db_config_file='./scripts/vectordb_benchmark/duckdb_db_config.txt' 
+```
+The db_config_file should be a text file, it specifies how much memory you would allow DuckDB to allocate, you can modify this file if you want, by default the memory limit is 100GB. The entire process may take over a day to complete, depending on your hardware set up. This code will download the index, extract the embedded vectors of the index, build the table in duckdb and run the benchmark. Alternatively, you can run the script `./scripts/vectordb_benchmark/benchmark_msmarco_duckdb.sh` to run the benchmark.
+
+## PGVector
+Now that we have DuckDB experiment finished, we can run the same experiment on PGVector. PGVector is an extension of PostgreSQL, so you will need to install both PostgreSQL and PGVector for this experiment.
+
+# Install PostgreSQL
+```bash
+conda install -c conda-forge postgresql
+```
+
+# Install PGVector
+To manually install pgvector, first install the necessary build tools (gcc and make) using Conda:
+```bash
+conda install -c conda-forge gcc_linux-64 make
+```
+Then, you can clone the pgvector repository, and make and install the extension.
+
+```bash
+git clone https://github.com/pgvector/pgvector.git
+cd pgvector
+make PG_CONFIG=$(which pg_config)
+make install
+```
+
+After the installation, verify that the pgvector.control file and library were installed correctly:
+```bash
+ls $(pg_config --sharedir)/extension/pgvector.control
+ls $(pg_config --pkglibdir)/vector.so
+```
+If both files are present, the installation was successful.
+
+# Start Database Server
+Now that installations are done, you can finally initialize the database, create the vector extension, create a user and database for your experiment and start your postgresql server. The script `vectordb_benchmark/init_and_start_postgres.sh` will do all of these for you, it will initialize the database, create a database called `main_database` and a user called `main_user`, and enable the vector extension. Therefore, you can simply run:
+```bash
+./init_and_start_postgres.sh ~/pgdata
+```
+and your postgresql server will be up and running on port 5432. The only argument is the directory for the database data, you can modify this if you want.
+
+# Run the Benchmark
+Now that you have the PGVector extension installed and enabled in PostgreSQL. You can start running the benchmark, but first, make sure you supply the correct database configuration in the `pgvector_db_config.txt` file. For example, by default:
+
+```
+dbname: main_db
+user: main_user
+password: 123456
+host: localhost
+port: 5432
+```
+
+Then, you can run the benchmark by running the following command. 
+
+```
+$ python3 vectordb_benchmark/run_benchmark.py \
+        --index_name='msmarco-v1-passage.bge-base-en-v1.5' \
+        --table_name='msmarco' \
+        --metric='ip' \
+        --query_index_path='collections/msmarco-passage/queries_faiss' \
+        --db_type='pgvector' \
+        --db_config_file='./scripts/vectordb_benchmark/pgvector_db_config.txt' \
+```
+or simply run the script `./scripts/vectordb_benchmark/benchmark_msmarco_pgvector.sh`
+
+Note that after one run, your postgresql will contain the table data, the current behaviour is to drop the table and index if they exist when the benchmark started. Later, we will add an option to skip table creation and index building, so that you can run the benchmark multiple times without having to re-create the table and index every time.
+
+# Results
+Too view the output of the benchmark, you can check the `msmarco_benchmark_results.txt` file in the `scripts/vectordb_benchmark` folder. It contains the Total time, the mean, variance, min, max time to run a single query on the HNSW index built in the vectordb, as well as the actual ndcg@10 result and verbose output of the trec evaluation tool. The raw trec evaluation output is in the file `trec_dot_product_output.txt` in the top level directory
+
+# NFCorpus
+
+## Data Prep
+Similar to the onboarding docs, we must first download the NFCorpus Dataset.
+
+```bash
+wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip -P collections
+unzip collections/nfcorpus.zip -d collections
+```
+
+## 1. Encode the Corpus
+Create a directory for document embeddings and encode the corpus using the specified encoder.
+
+```bash
+mkdir indexes/faiss-nfcorpus
+mkdir indexes/faiss-nfcorpus/documents
+python -m pyserini.encode \
+  input   --corpus collections/nfcorpus/corpus.jsonl \
+  output  --embeddings indexes/faiss-nfcorpus/documents \
+  encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
+          --device cpu \
+          --pooling mean \
+          --batch 32
+```
+
+## 2. Encode the Queries
+Create a directory for query embeddings and encode the queries using the specified encoder.
+
+```bash
+mkdir indexes/faiss-nfcorpus/queries
+python -m pyserini.encode \
+  input   --corpus collections/nfcorpus/queries.jsonl \
+  output  --embeddings indexes/faiss-nfcorpus/queries \
+  encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
+          --device cpu \
+          --pooling mean \
+          --batch 32
+```
+
+## 3. Run Benchmarks
+
+```bash
+python3 ./scripts/vectordb_benchmark/benchmark_nfcorpus_duckdb.py 
+python3 ./scripts/vectordb_benchmark/benchmark_nfcorpus_pgvector.py
diff --git a/scripts/msmarco-passage/encode_queries.py b/scripts/msmarco-passage/encode_queries.py
diff --git a/scripts/vectordb_benchmark/benchmark_msmarco_duckdb.sh b/scripts/vectordb_benchmark/benchmark_msmarco_duckdb.sh
@@ -0,0 +1,7 @@
+python3 ./run_benchmark.py \
+--index_name='msmarco-v1-passage.bge-base-en-v1.5' \
+--table_name='msmarco' \
+--metric='ip' \
+--query_index_path='../../collections/msmarco-passage/queries_faiss' \
+--db_type='duckdb' \
+--db_config_file='duckdb_db_config.txt' 
diff --git a/scripts/vectordb_benchmark/benchmark_msmarco_pgvector.sh b/scripts/vectordb_benchmark/benchmark_msmarco_pgvector.sh
@@ -0,0 +1,7 @@
+python3 ./run_benchmark.py \
+--index_name='msmarco-v1-passage.bge-base-en-v1.5' \
+--table_name='msmarco' \
+--metric='ip' \
+--query_index_path='../../collections/msmarco-passage/queries_faiss' \
+--db_type='pgvector' \
+--db_config_file='pgvector_db_config.txt' 
diff --git a/scripts/vectordb_benchmark/benchmark_nfcorpus_duckdb.py b/scripts/vectordb_benchmark/benchmark_nfcorpus_duckdb.py
@@ -0,0 +1,134 @@
+import json
+import duckdb
+import numpy as np
+import subprocess
+import time
+
+# Paths to embedding, query, and output files
+DOCUMENT_JSONL_FILE_PATH = '../../indexes/faiss-nfcorpus/documents/embeddings.jsonl'
+QUERY_JSONL_FILE_PATH = '../../indexes/faiss-nfcorpus/queries/embeddings.jsonl'
+TREC_DOT_PRODUCT_OUTPUT_FILE_PATH = '../../runs/.run-faiss-nfcorpus-result_dot_product.txt'
+TREC_COSINE_OUTPUT_FILE_PATH = '../../runs/.run-faiss-nfcorpus-result_cosine.txt'
+TREC_L2SQ_OUTPUT_FILE_PATH = '../../runs/.run-faiss-nfcorpus-result_l2sq.txt'
+K = 10  # Number of nearest neighbors to retrieve
+RUN_ID = "DuckDBHNSW"  # Identifier for the run
+
+def get_vector_size(jsonl_file_path):
+    """Determines the size of the vector, assuming vectors all have the same dimension."""
+    with open(jsonl_file_path, 'r') as file:
+        for line in file:
+            data = json.loads(line)
+            vector = data.get('vector', [])
+            return len(vector)
+    return 0
+
+def insert_data_into_table(con, id, content, vector, table):
+    """Inserts data into the DuckDB table."""
+    con.execute(f"INSERT INTO {table} (id, content, vector) VALUES (?, ?, ?)", (id, content, vector))
+
+def setup_database():
+    """Sets up the DuckDB database and inserts document data."""
+    con = duckdb.connect(database=':memory:')
+    con.execute("INSTALL vss")
+    con.execute("LOAD vss")
+    con.execute("PRAGMA temp_directory='/tmp/duckdb_temp'")
+    con.execute("PRAGMA memory_limit='4GB'")
+
+    vector_size = get_vector_size(DOCUMENT_JSONL_FILE_PATH)
+    print(f"Vector size: {vector_size}")
+
+    # Create documents table
+    con.execute(f"""
+        CREATE TABLE documents (
+            id STRING,
+            content STRING,
+            vector FLOAT[{vector_size}]
+        )
+    """)
+
+    # Insert data from JSONL file
+    with open(DOCUMENT_JSONL_FILE_PATH, 'r') as file:
+        for line in file:
+            data = json.loads(line)
+            insert_data_into_table(con, data['id'], data['contents'], data['vector'], 'documents')
+
+    # Create HNSW indices with different metrics
+    # print the time taken for each index building
+    start_time = time.time()
+    con.execute("CREATE INDEX l2sq_idx ON documents USING HNSW(vector) WITH (metric = 'l2sq')")
+    print('building l2sq index: ', time.time() - start_time)
+    start_time = time.time()
+    con.execute("CREATE INDEX cos_idx ON documents USING HNSW(vector) WITH (metric = 'cosine')")
+    print('building cosine index: ', time.time() - start_time)
+    start_time = time.time()
+    con.execute("CREATE INDEX ip_idx ON documents USING HNSW(vector) WITH (metric = 'ip')")
+    print('building ip index: ', time.time() - start_time)
+
+    return con
+
+def run_trec_eval(trec_output_file_path):
+    """Runs TREC evaluation and prints ndcg@10."""
+    command = [
+        "python", "-m", "pyserini.eval.trec_eval",
+        "-c", "-m", "ndcg_cut.10",
+        "../../collections/nfcorpus/qrels/test.qrels",
+        trec_output_file_path
+    ]
+    print("ndcg@10 for ", trec_output_file_path)
+    subprocess.run(command)
+
+def run_benchmark(con, trec_output_file_path, metric):
+    """Runs the benchmark and writes results in TREC format."""
+    query_times = []
+    with open(trec_output_file_path, 'w') as trec_file:
+        with open(QUERY_JSONL_FILE_PATH, 'r') as query_file:
+            for line in query_file:
+                data = json.loads(line)
+                query_id = data['id']
+                vector = data['vector']
+
+                # Select appropriate SQL query based on the metric
+                if metric == 'l2sq':
+                    evaluation_metric = 'array_distance'
+                elif metric == 'cosine':
+                    evaluation_metric = 'array_cosine_similarity'
+                elif metric == 'ip':
+                    evaluation_metric = 'array_inner_product'
+
+                sql_query = f"SELECT id, {evaluation_metric}(vector, ?::FLOAT[{len(vector)}]) as score FROM documents ORDER BY score DESC LIMIT ?"
+                # time the execution
+                start_time = time.time()
+                results = con.execute(sql_query, (vector, K)).fetchall()
+                end_time = time.time()
+
+                # Calculate the time for this query and add it to the list
+                query_time = end_time - start_time
+                query_times.append(query_time)
+
+                # Write results in TREC format
+                for rank, (doc_id, score) in enumerate(results, start=1):
+                    trec_file.write(f"{query_id} Q0 {doc_id} {rank} {score} {RUN_ID}\n")
+
+    print(f"TREC results written to {trec_output_file_path}")
+    run_trec_eval(trec_output_file_path)
+    # Aggregate statistics
+    total_time = sum(query_times)
+    mean_time = np.mean(query_times)
+    variance_time = np.var(query_times)
+    min_time = min(query_times)
+    max_time = max(query_times)
+    return total_time, mean_time, variance_time, min_time, max_time
+
+if __name__ == "__main__":
+    con = setup_database()
+
+    # Running the benchmarks
+    print('l2sq: ', run_benchmark(con, TREC_L2SQ_OUTPUT_FILE_PATH, 'l2sq'))
+    print('cosine: ', run_benchmark(con, TREC_COSINE_OUTPUT_FILE_PATH, 'cosine'))
+    print('ip: ', run_benchmark(con, TREC_DOT_PRODUCT_OUTPUT_FILE_PATH, 'ip'))
+
+    # second run
+    print("second run")
+    print('l2sq: ', run_benchmark(con, TREC_L2SQ_OUTPUT_FILE_PATH, 'l2sq'))
+    print('cosine: ', run_benchmark(con, TREC_COSINE_OUTPUT_FILE_PATH, 'cosine'))
+    print('ip: ', run_benchmark(con, TREC_DOT_PRODUCT_OUTPUT_FILE_PATH, 'ip'))