-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Obtain hashvalues by key from MinHashLSH #186
Comments
MinHashLSH is designed for looking up keys given hashvalues (i.e., MinHash), but does not natively support the reverse lookup. I think a simple dictionary for key-> hash values would be helpful. |
@ekzhu can you give an example of that code? I spent some time and didn't get how to make it happen in the right way? Appreciate your help |
@ekzhu I wrote the code that recovers original min-hashes of a document obtained from MinHashLSH (Cassandra storage), but I found a frustrating issue (maybe expected) that not all items from the array of min-hash permutations are properly stored in LSH index (regardless it is Cassandra storage or not). The conclusions are looks as following:
The script was used for testing the above behavior is here Root cause is derived from _optimal_param function which exists in MinHashLSH class. Thanks and sorry if this is the expected behavior of LSH implementation (didn't have a chance to deep dive into it) |
@ekzhu your advice is highly appreciated, PING |
Thanks for the interesting benchmark. Yes, your observation is correct. LSH asks for a fixed band-size. So, if the optimizer returns a band-size that doesn't divide the number of permutation functions evently, some minimum hash values will be lost. If we were to constraint the optimization space to only band-sizes that are integer divisors of num_perm, then we would have less accurate index on average. |
@ekzhu understood, so if I'll use 8 bands with 16 items in this case accuracy of prediction will be a little less accurate right? |
If your num_perm = 128 and you use 8 x 16 your will be using all the hashvalues, but your accuracy may not be better than using 12 x 10. This depends on the threshold you use, and the type of data have. Since we cannot predict what data you are going to put in the index, the best-effort optimization is performed with threshold only. |
@ekzhu understood, thx |
a similar discussion regarding _optimal_param. #200 |
@pavelnemirovsky True, maybe a consideration is to refactor the hyper-parameter optimization out of MinHashLSH so user can choose what objective function they would like to use. |
Guys, I am desperately looking for the ability to obtain hash values as an array of int(s) based on provided key? Any direction? Thanks in advance,
P
The text was updated successfully, but these errors were encountered: