Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process-safe, no mem bloat, implementation of LSH #231

Open
gordicaleksa opened this issue Nov 19, 2023 · 1 comment
Open

Process-safe, no mem bloat, implementation of LSH #231

gordicaleksa opened this issue Nov 19, 2023 · 1 comment

Comments

@gordicaleksa
Copy link

gordicaleksa commented Nov 19, 2023

Thanks a lot for your work :) amazing job!

Are there any plans to create an implementation that can be parallelized across multiple threads (processes in Python)?

More context:
I have a large file with millions of lines of text, each of those lines is indexed into the LSH as I'm trying to remove duplicate lines.

once I insert all of those lines into LSH I'd love to be able to parallelize deduplication process operating on the same LSH object.

in each of the workers I query whether there are candidate sentences and if there are and their keys are different from the current line I remove the current line from the LSH.

I want to do that in parallel using your lib - how?

Note: I tried adding a multiprocessing lock to your lib (quick implementation), that works but bloats up my memory as the index is copied across each of the processes so 70 GB quickly turns into 1 TB. Shared memory still requires me to unpickle the object in the process which again leads to mem bloat.

is there a way to use Cassandra or Redis to achieve this? I'd just need to synchronize the process access to the database? any hints here? :)

@gordicaleksa gordicaleksa changed the title Thread-safe implementation of LSH Process-safe, no mem bloat, implementation of LSH Nov 21, 2023
@ekzhu
Copy link
Owner

ekzhu commented Nov 29, 2023

Currently there is no built-in support for parallelization. In the past (before LLMs) I have been using Celery to parallelize my data processing jobs. It worked great for me. You can check out how I used it in the findopendata project, which performs data sketching on public datasets.

Of course, this requires quite a bit of work, much more complex than Python's multiprocessing module. So, I think if you want to use multiprocessing it is a great first step. I think there is a way to parallelize a specific data processing task e.g., text deduplication by first framing it as multi-stage workflow:

  1. Create MinHashes -- easily parallelized to multiple workers. This is likely the most time-consuming stage as you are dealing with raw data.
  2. Join by LSH bands. An simple way to go about it is to simply use the datasketch's MinHashLSH: once you have indexed all the MinHashes, you can iterate over buckets of each band to find groups of near-duplicate candidates.
for hashtable in lsh.hashtables:
    for key in hashtable.keys():
        near_duplicate_candidate_IDs = hashtable.get(key)
        #  do something about these, e.g., save ordered pairs of candidates into a set for future lookup.
  1. Refine near-duplicate candidates. Because LSH is approximate, so you may get false positives. So, it is important to make sure each near duplicate pair was correctly identified. You can use Jaccard or something to check each pair of text. You can parallelize this easily. You can then use the refined output to create a lookup table that maps each text to its near-duplicates.

Now for each piece of text, you can lookup whether there exists a near duplicate. You can probably do something clever here by taking advantage of the order in which the pieces of text are inspected.
 
Hope this helps.

If you like it, please consider submitting a PR request for your specific working scenario as a workflow in the datasketch library. I can create a separate page to document different workflows built using datasketch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants