Process-safe, no mem bloat, implementation of LSH #231

gordicaleksa · 2023-11-19T19:50:18Z

Thanks a lot for your work :) amazing job!

Are there any plans to create an implementation that can be parallelized across multiple threads (processes in Python)?

More context:
I have a large file with millions of lines of text, each of those lines is indexed into the LSH as I'm trying to remove duplicate lines.

once I insert all of those lines into LSH I'd love to be able to parallelize deduplication process operating on the same LSH object.

in each of the workers I query whether there are candidate sentences and if there are and their keys are different from the current line I remove the current line from the LSH.

I want to do that in parallel using your lib - how?

Note: I tried adding a multiprocessing lock to your lib (quick implementation), that works but bloats up my memory as the index is copied across each of the processes so 70 GB quickly turns into 1 TB. Shared memory still requires me to unpickle the object in the process which again leads to mem bloat.

is there a way to use Cassandra or Redis to achieve this? I'd just need to synchronize the process access to the database? any hints here? :)

ekzhu · 2023-11-29T05:44:13Z

Currently there is no built-in support for parallelization. In the past (before LLMs) I have been using Celery to parallelize my data processing jobs. It worked great for me. You can check out how I used it in the findopendata project, which performs data sketching on public datasets.

Of course, this requires quite a bit of work, much more complex than Python's multiprocessing module. So, I think if you want to use multiprocessing it is a great first step. I think there is a way to parallelize a specific data processing task e.g., text deduplication by first framing it as multi-stage workflow:

Create MinHashes -- easily parallelized to multiple workers. This is likely the most time-consuming stage as you are dealing with raw data.
Join by LSH bands. An simple way to go about it is to simply use the datasketch's MinHashLSH: once you have indexed all the MinHashes, you can iterate over buckets of each band to find groups of near-duplicate candidates.

for hashtable in lsh.hashtables:
    for key in hashtable.keys():
        near_duplicate_candidate_IDs = hashtable.get(key)
        #  do something about these, e.g., save ordered pairs of candidates into a set for future lookup.

Refine near-duplicate candidates. Because LSH is approximate, so you may get false positives. So, it is important to make sure each near duplicate pair was correctly identified. You can use Jaccard or something to check each pair of text. You can parallelize this easily. You can then use the refined output to create a lookup table that maps each text to its near-duplicates.

Now for each piece of text, you can lookup whether there exists a near duplicate. You can probably do something clever here by taking advantage of the order in which the pieces of text are inspected.

Hope this helps.

If you like it, please consider submitting a PR request for your specific working scenario as a workflow in the datasketch library. I can create a separate page to document different workflows built using datasketch.

gordicaleksa changed the title ~~Thread-safe implementation of LSH~~ Process-safe, no mem bloat, implementation of LSH Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process-safe, no mem bloat, implementation of LSH #231

Process-safe, no mem bloat, implementation of LSH #231

gordicaleksa commented Nov 19, 2023 •

edited

Loading

ekzhu commented Nov 29, 2023 •

edited

Loading

Process-safe, no mem bloat, implementation of LSH #231

Process-safe, no mem bloat, implementation of LSH #231

Comments

gordicaleksa commented Nov 19, 2023 • edited Loading

ekzhu commented Nov 29, 2023 • edited Loading

gordicaleksa commented Nov 19, 2023 •

edited

Loading

ekzhu commented Nov 29, 2023 •

edited

Loading