You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some files on Solr are actually empty (often 14B), which can cause problems with Dask.
Having some functions in general that allow to sanity check the contents of S3 buckets etc.
This was discussed with @e-maud who already has some functions/code snippets that can be repurposed for everybody.
The text was updated successfully, but these errors were encountered:
*jsonl.bz2 files without content, should have exactly 14 bytes that represent a valid bz2 file format. When you open such a filte with a bz2 reader (by bz2 module or via smart_open) it will correctly work as if you would open an empty file with the normal open.
Note that a really empty file is not a valid bz2 file. So seeing exactly 14 bytes is a good sign.
The main issue is probably for processing steps that somehow need to create some values for empty files (e.g. manifest returning a 0 on that newspaper/year). But the code should be robust to deal with such situations anyway.
A function remove_corrupted_files has been created and added to compute_manifest.py. This function should eventually be moved to utils.py and be comprised of a fuller-more detailed check.
In particular, based on Simon's responde, it is to be expected that some archives will be empty.
Some files on Solr are actually empty (often 14B), which can cause problems with Dask.
Having some functions in general that allow to sanity check the contents of S3 buckets etc.
This was discussed with @e-maud who already has some functions/code snippets that can be repurposed for everybody.
The text was updated successfully, but these errors were encountered: