Manifests: Add aggregators for all missing data stages #8

piconti · 2024-10-08T14:42:53Z

We now have generated data for several data stages for which we can't compute manifests yet.
Hence, this issue aims at listing all the data stages for which new aggregating functions should be implemented, which should be added to compute_manifest.py as an option.

The types are the following:

text-reuse (based on the text-reuse passages)
news-agencies (same schema as entities, they will have exactly the same keys and same aggregations, just a different manifest name)
topics (based on s3://42-processed-data-final/topics/)
article embeddings (based on s3://42-processed-data-final/embeddings/articles)
page embeddings (based on s3://42-processed-data-final/embeddings/pages)
linguistic processing (based on s3://42-processed-data-final/lingproc)

Note that for all data stages which are part of data processing, we can have multiple versions which all stem from the same input data, and have been generated at the same time, simply with different models or parameters. As a result, in order to prevent confusion inside the impresso-data-release repository, the full s3 partition inside the bucket will also be used as path within the repo.
Eg.:
topics have three different types of outputs: for french, english and german, which are all in their own s3 partition (eg s3://42-processed-data-final/topics/tm-de-all-v2024.08.29/). The relative path within the git repo for this generated manifest would then be: data-processing/topics/tm-de-all-v2024.08.29/topics_v*-*-*.json.

Optionally, it will be possible to define this relative path (to make it simpler for example). I will then be necessary to be alert to the value used for this git relative path, making sure that it stays consistent from one time to the next, noting it will default to the s3 partition.

The text was updated successfully, but these errors were encountered:

piconti · 2024-10-09T09:09:30Z

Embeddings are not ready yet to be versioned, and linguistic processing either (only some titles are present on the S3).
As a result, I'll merge the current branch which contains some bugfixes and addition of new aggregators, and will repeat the process when including the embeddingsd aggregators.

piconti · 2025-01-15T10:46:03Z

Aggregators for new data stages are added iteratively as the data is ready.
Once the Polar night release is fully done, and all manifests have been generated this issue will be able to be closed.

Currently, the stages for which aggregating functions were added are:

Text-reuse
Topics
news agencies
image embeddings
linguistic processing

Still missing stages are:

stages relating to embeddings (word/doc/entities etc)
stages retating to solr ingestion

piconti self-assigned this Oct 8, 2024

piconti mentioned this issue Oct 9, 2024

Add aggregators #9

Merged

e-maud changed the title ~~Add aggregators for all missing data stages~~ Manifests: Add aggregators for all missing data stages Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manifests: Add aggregators for all missing data stages #8

Manifests: Add aggregators for all missing data stages #8

piconti commented Oct 8, 2024 •

edited

Loading

piconti commented Oct 9, 2024

piconti commented Jan 15, 2025

Manifests: Add aggregators for all missing data stages #8

Manifests: Add aggregators for all missing data stages #8

Comments

piconti commented Oct 8, 2024 • edited Loading

piconti commented Oct 9, 2024

piconti commented Jan 15, 2025

piconti commented Oct 8, 2024 •

edited

Loading