-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too large minhashLSH index #207
Comments
There is an option to use Redis as the storage layer, which is also in-memory but you can use a cluster. Currently there is no support for on-disk index. What's your usage scenario? Is it possible to pre-partition the billion documents some way to avoid a single LSH index on all the documents? |
Thank you for your suggestion. I think a distributed database is the way to go. I have tried redis on a single machine, it takes more memory than naive in-memory index, maybe due to the extra persistency. My usage scenario is to perform global deduplication for billions of documents. If I pre-partition the data, it is not global deduplication anymore. But I guess I can use some heuristic to cluster the data, then perform an approximate global deduplication, what do you think? |
People have been using the LSH hash keys for partitioning/bucketizing documents for deduplication. See: https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/
They used this library. You can read their paper about how they did it. The key is that if you select the right hyperparameters/threshold setting, documents in the same LSH hash bucket have high probability of being duplicate. So you can think of some batch algorithm rather than storing everything in memory. To do this you probably want to use the |
I'm trying to do something similar to what is explained in the paper
However, they say that they calculate MinHashes of the vectorized documents, but the |
@ZJaume for one-hot encoded vectors, you want to hash the indices of the dimensions with value 1. MinHash is a lossy compression of set. |
Do you mean by doing something like this? text = ['This is an example text']
vectorizer = HashingVectorizer()
msh = MinHash()
X = vectorizer.fit_transform(text)
for i in X[0].indices:
msh.update(str(i).encode()) |
Yep. Just make sure You may want to test different ways of binary encoding your indices to minimize latency on this critical path. Also |
Can you share with me how to do that? |
I did some modifications that let you build an index that only stores one of the buckets. It is in my fork: https://github.com/ZJaume/datasketch/tree/partitions |
hi, we encountered a similar scenario: there are hundreds of millions of text data to process, so far we tried Spark cluster but it didn't work well. At present, have we tried to come up with a better plan? |
Hi, I have a question about large-scale LSH index. If I have billions of documents, I suppose even 1T RAM is not enough to do in-memory LSH, is there any recommended way to use datasketch for this scenario? Thank you.
I also opened an issue #206 because for a small subset on my local machine (a 6GB pickle file containing pre-computed minhashes), if I use LSH threhold of 0.5 for inserting to minhashLSH, it takes 31GB RAM. If I use leanminhash, it takes 26GB. Then I can do a simple extrapolation, for 600GB pre-computed minhashes, the indexing process will take 3T RAM. This is just too much. Maybe mapping it to disk could be a viable solution. Looking forward to any suggestions, thank you.
The text was updated successfully, but these errors were encountered: