Out of memory when creating customized dense index from a large collection #1238

dayuyang1999 · 2022-07-06T17:36:19Z

dayuyang1999
Jul 6, 2022

Hi,

I am trying to create a dense index from a large document collection (roughly 40GB).

I did minor modification from the sample code in the main page of pyserini "Guide to indexing and searching English documents":

python -m pyserini.encode input   --corpus /home/dataset_2021/jsonls \
                                  --fields text \
                                  
                          output  --embeddings /home/dataset_2021/indexes/2021_den_0701 \
                                  --to-faiss \
                          encoder --encoder castorini/tct_colbert-v2-hnp-msmarco \
                                  --fields text \
                                  --batch 16

However, every time I run the code. When the progress bar is reaching 4%, the execution will be killed automatically because my machine runs out of memory (total of 64G of ram).

I am wondering if there is any solution for the huge memory usage? or do I do something wrong?

Thanks!

Answered by crystina-z

Jul 21, 2022

Hi @dayuyang1999! Thanks for your interest in our framework. Sorry for letting you wait for this long.

When it requires large memory to encode, we usually shard the indexing process by adding --shard-id $currend_id --shard-num $total under the input argument. Then perform retrieval on each sub-index independently and finally merge the retrieval results. (You can also merge the sub-index but that would also required large memory)

Example of shard indexing can be found under - https://github.com/castorini/pyserini#dense-indexes. and the retrieval command would be same (just change the path to each sub-index).

View full answer

dayuyang1999 · 2022-07-21T21:58:40Z

dayuyang1999
Jul 21, 2022
Author

I am not sure if the dense index indeed requires a massive amount of memory space.

Since 64G can only reach 4% of the process, it seems that I need 64*25=1600GB of ram to build the dense index for my 40G document collection.

0 replies

crystina-z · 2022-07-21T22:05:32Z

crystina-z
Jul 21, 2022
Collaborator

Hi @dayuyang1999! Thanks for your interest in our framework. Sorry for letting you wait for this long.

When it requires large memory to encode, we usually shard the indexing process by adding --shard-id $currend_id --shard-num $total under the input argument. Then perform retrieval on each sub-index independently and finally merge the retrieval results. (You can also merge the sub-index but that would also required large memory)

Example of shard indexing can be found under - https://github.com/castorini/pyserini#dense-indexes. and the retrieval command would be same (just change the path to each sub-index).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory when creating customized dense index from a large collection #1238

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Out of memory when creating customized dense index from a large collection #1238

dayuyang1999 Jul 6, 2022

Replies: 2 comments

dayuyang1999 Jul 21, 2022 Author

crystina-z Jul 21, 2022 Collaborator

dayuyang1999
Jul 6, 2022

dayuyang1999
Jul 21, 2022
Author

crystina-z
Jul 21, 2022
Collaborator