-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process-safe, no mem bloat, implementation of LSH #231
Comments
Currently there is no built-in support for parallelization. In the past (before LLMs) I have been using Celery to parallelize my data processing jobs. It worked great for me. You can check out how I used it in the findopendata project, which performs data sketching on public datasets. Of course, this requires quite a bit of work, much more complex than Python's multiprocessing module. So, I think if you want to use multiprocessing it is a great first step. I think there is a way to parallelize a specific data processing task e.g., text deduplication by first framing it as multi-stage workflow:
for hashtable in lsh.hashtables:
for key in hashtable.keys():
near_duplicate_candidate_IDs = hashtable.get(key)
# do something about these, e.g., save ordered pairs of candidates into a set for future lookup.
Now for each piece of text, you can lookup whether there exists a near duplicate. You can probably do something clever here by taking advantage of the order in which the pieces of text are inspected. If you like it, please consider submitting a PR request for your specific working scenario as a workflow in the datasketch library. I can create a separate page to document different workflows built using datasketch. |
Thanks a lot for your work :) amazing job!
Are there any plans to create an implementation that can be parallelized across multiple threads (processes in Python)?
More context:
I have a large file with millions of lines of text, each of those lines is indexed into the LSH as I'm trying to remove duplicate lines.
once I insert all of those lines into LSH I'd love to be able to parallelize deduplication process operating on the same LSH object.
in each of the workers I query whether there are candidate sentences and if there are and their keys are different from the current line I remove the current line from the LSH.
I want to do that in parallel using your lib - how?
Note: I tried adding a multiprocessing lock to your lib (quick implementation), that works but bloats up my memory as the index is copied across each of the processes so 70 GB quickly turns into 1 TB. Shared memory still requires me to unpickle the object in the process which again leads to mem bloat.
is there a way to use Cassandra or Redis to achieve this? I'd just need to synchronize the process access to the database? any hints here? :)
The text was updated successfully, but these errors were encountered: