Too large minhashLSH index #207

bryanyzhu · 2023-05-28T22:10:51Z

Hi, I have a question about large-scale LSH index. If I have billions of documents, I suppose even 1T RAM is not enough to do in-memory LSH, is there any recommended way to use datasketch for this scenario? Thank you.

I also opened an issue #206 because for a small subset on my local machine (a 6GB pickle file containing pre-computed minhashes), if I use LSH threhold of 0.5 for inserting to minhashLSH, it takes 31GB RAM. If I use leanminhash, it takes 26GB. Then I can do a simple extrapolation, for 600GB pre-computed minhashes, the indexing process will take 3T RAM. This is just too much. Maybe mapping it to disk could be a viable solution. Looking forward to any suggestions, thank you.

ekzhu · 2023-05-31T12:57:51Z

There is an option to use Redis as the storage layer, which is also in-memory but you can use a cluster.

Currently there is no support for on-disk index. What's your usage scenario? Is it possible to pre-partition the billion documents some way to avoid a single LSH index on all the documents?

bryanyzhu · 2023-05-31T21:29:39Z

Thank you for your suggestion. I think a distributed database is the way to go. I have tried redis on a single machine, it takes more memory than naive in-memory index, maybe due to the extra persistency.

My usage scenario is to perform global deduplication for billions of documents. If I pre-partition the data, it is not global deduplication anymore. But I guess I can use some heuristic to cluster the data, then perform an approximate global deduplication, what do you think?

ekzhu · 2023-06-01T06:34:36Z

People have been using the LSH hash keys for partitioning/bucketizing documents for deduplication. See: https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/

Document deduplication is necessary in building training datasets because the same content can be present in multiple documents of different datasets. We used a fuzzy deduplication process at the document level using min-hash LSH to compute a sparse document graph and the connected components in it to identify duplicate documents.

They used this library. You can read their paper about how they did it.

The key is that if you select the right hyperparameters/threshold setting, documents in the same LSH hash bucket have high probability of being duplicate. So you can think of some batch algorithm rather than storing everything in memory. To do this you probably want to use the MinHashLSH.hashtables object directly, some understanding of how the LSH index works also help.

ZJaume · 2023-06-05T14:56:12Z

I'm trying to do something similar to what is explained in the paper

Exact match duplicates would be computationally expensive, so we opted to take a fuzzy deduplication
approach similar to other works [9, 17]. We used a hashing vectorizer with 1,048,576 features to vectorize
documents (HashingVectorizer from scikit-learn4), calculated min-hashes of the vectorized documents
(using datasketch5), and performed Locality Sensitive Hashing (LSH) through datasketch on all min-
hashes in order to identify potential duplicates. We set our LSH parameters in such a way as to increase the
likelihood that documents with Jaccard similarity ≥ 0.8 would occur in at least one LSH bucket together.
Specifically, we used 20 bands of size 13 for a total of 260 hash functions.

However, they say that they calculate MinHashes of the vectorized documents, but the MinHash.update function takes bytes as an argument and crashes if I give an sparse matrix as argument. Any idea on how could this be done?

ekzhu · 2023-06-06T05:54:11Z

@ZJaume for one-hot encoded vectors, you want to hash the indices of the dimensions with value 1. MinHash is a lossy compression of set.

ZJaume · 2023-06-06T07:38:09Z

Do you mean by doing something like this?

text = ['This is an example text']
vectorizer = HashingVectorizer()
msh = MinHash()
X = vectorizer.fit_transform(text)
for i in X[0].indices:
     msh.update(str(i).encode())

ekzhu · 2023-06-06T17:15:49Z

Yep. Just make sure X[0].indices returns only the non-zero dimensions.

You may want to test different ways of binary encoding your indices to minimize latency on this critical path.

Also MinHash.bulk may be helpful.

duytran1332002 · 2023-06-29T06:48:49Z

LSH hash keys for partitioning/bucketizing documents

Can you share with me how to do that?

ZJaume · 2023-06-29T07:45:26Z

I did some modifications that let you build an index that only stores one of the buckets. It is in my fork: https://github.com/ZJaume/datasketch/tree/partitions

liang-ctrl · 2023-11-29T02:18:01Z

hi, we encountered a similar scenario: there are hundreds of millions of text data to process, so far we tried Spark cluster but it didn't work well. At present, have we tried to come up with a better plan?

bryanyzhu changed the title ~~Too large index~~ Too large minhashLSH index May 28, 2023

ekzhu added the question label Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too large minhashLSH index #207

Too large minhashLSH index #207

bryanyzhu commented May 28, 2023 •

edited

Loading

ekzhu commented May 31, 2023

bryanyzhu commented May 31, 2023

ekzhu commented Jun 1, 2023

ZJaume commented Jun 5, 2023 •

edited

Loading

ekzhu commented Jun 6, 2023 •

edited

Loading

ZJaume commented Jun 6, 2023 •

edited

Loading

ekzhu commented Jun 6, 2023 •

edited

Loading

duytran1332002 commented Jun 29, 2023

ZJaume commented Jun 29, 2023

liang-ctrl commented Nov 29, 2023 •

edited

Loading

Too large minhashLSH index #207

Too large minhashLSH index #207

Comments

bryanyzhu commented May 28, 2023 • edited Loading

ekzhu commented May 31, 2023

bryanyzhu commented May 31, 2023

ekzhu commented Jun 1, 2023

ZJaume commented Jun 5, 2023 • edited Loading

ekzhu commented Jun 6, 2023 • edited Loading

ZJaume commented Jun 6, 2023 • edited Loading

ekzhu commented Jun 6, 2023 • edited Loading

duytran1332002 commented Jun 29, 2023

ZJaume commented Jun 29, 2023

liang-ctrl commented Nov 29, 2023 • edited Loading

bryanyzhu commented May 28, 2023 •

edited

Loading

ZJaume commented Jun 5, 2023 •

edited

Loading

ekzhu commented Jun 6, 2023 •

edited

Loading

ZJaume commented Jun 6, 2023 •

edited

Loading

ekzhu commented Jun 6, 2023 •

edited

Loading

liang-ctrl commented Nov 29, 2023 •

edited

Loading