You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I get this error when running a simple pipeline with MinhashBuildIndex:
File "/tmp/ray/session_2025-01-28_00-11-53_869524_544/runtime_resources/pip/423edced06de87e59e89d62d04d06ae3a96700c1/virtualenv/lib/python3.11/site-packages/datatrove/executor/base.py", line 109, in _run_for_rank
raise e
File "/tmp/ray/session_2025-01-28_00-11-53_869524_544/runtime_resources/pip/423edced06de87e59e89d62d04d06ae3a96700c1/virtualenv/lib/python3.11/site-packages/datatrove/executor/base.py", line 90, in _run_for_rank
pipelined_data = pipeline_step(pipelined_data, rank, self.world_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-01-28_00-11-53_869524_544/runtime_resources/pip/423edced06de87e59e89d62d04d06ae3a96700c1/virtualenv/lib/python3.11/site-packages/datatrove/pipeline/base.py", line 119, in __call__
return self.run(data, rank, world_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-01-28_00-11-53_869524_544/runtime_resources/pip/423edced06de87e59e89d62d04d06ae3a96700c1/virtualenv/lib/python3.11/site-packages/datatrove/pipeline/dedup/minhash.py", line 679, in run
pq = [next(sig_reader) for sig_reader in sig_readers]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-01-28_00-11-53_869524_544/runtime_resources/pip/423edced06de87e59e89d62d04d06ae3a96700c1/virtualenv/lib/python3.11/site-packages/datatrove/pipeline/dedup/minhash.py", line 679, in <listcomp>
pq = [next(sig_reader) for sig_reader in sig_readers]
^^^^^^^^^^^^^^^^
StopIteration
The text was updated successfully, but these errors were encountered:
Though actually, it seems like index creation is more efficient as a byproduct of MinhashDedupBuckets? e.g., these two should be equivalent, but MinhashDedupBuckets is far more parallelized and thus could be faster, even though it has to identify the dups:
I get this error when running a simple pipeline with
MinhashBuildIndex
:The text was updated successfully, but these errors were encountered: