Out of memory when creating customized dense index from a large collection #1238
-
Hi, I am trying to create a dense index from a large document collection (roughly 40GB). I did minor modification from the sample code in the main page of pyserini "Guide to indexing and searching English documents": python -m pyserini.encode input --corpus /home/dataset_2021/jsonls \
--fields text \
output --embeddings /home/dataset_2021/indexes/2021_den_0701 \
--to-faiss \
encoder --encoder castorini/tct_colbert-v2-hnp-msmarco \
--fields text \
--batch 16
However, every time I run the code. When the progress bar is reaching 4%, the execution will be killed automatically because my machine runs out of memory (total of 64G of ram). I am wondering if there is any solution for the huge memory usage? or do I do something wrong? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I am not sure if the dense index indeed requires a massive amount of memory space. Since 64G can only reach 4% of the process, it seems that I need 64*25=1600GB of ram to build the dense index for my 40G document collection. |
Beta Was this translation helpful? Give feedback.
-
Hi @dayuyang1999! Thanks for your interest in our framework. Sorry for letting you wait for this long. When it requires large memory to encode, we usually shard the indexing process by adding Example of shard indexing can be found under - https://github.com/castorini/pyserini#dense-indexes. and the retrieval command would be same (just change the path to each sub-index). |
Beta Was this translation helpful? Give feedback.
Hi @dayuyang1999! Thanks for your interest in our framework. Sorry for letting you wait for this long.
When it requires large memory to encode, we usually shard the indexing process by adding
--shard-id $currend_id --shard-num $total
under theinput
argument. Then perform retrieval on each sub-index independently and finally merge the retrieval results. (You can also merge the sub-index but that would also required large memory)Example of shard indexing can be found under - https://github.com/castorini/pyserini#dense-indexes. and the retrieval command would be same (just change the path to each sub-index).