You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We now have generated data for several data stages for which we can't compute manifests yet.
Hence, this issue aims at listing all the data stages for which new aggregating functions should be implemented, which should be added to compute_manifest.py as an option.
The types are the following:
text-reuse (based on the text-reuse passages)
news-agencies (same schema as entities, they will have exactly the same keys and same aggregations, just a different manifest name)
topics (based on s3://42-processed-data-final/topics/)
article embeddings (based on s3://42-processed-data-final/embeddings/articles)
page embeddings (based on s3://42-processed-data-final/embeddings/pages)
linguistic processing (based on s3://42-processed-data-final/lingproc)
Note that for all data stages which are part of data processing, we can have multiple versions which all stem from the same input data, and have been generated at the same time, simply with different models or parameters. As a result, in order to prevent confusion inside the impresso-data-release repository, the full s3 partition inside the bucket will also be used as path within the repo.
Eg.:
topics have three different types of outputs: for french, english and german, which are all in their own s3 partition (eg s3://42-processed-data-final/topics/tm-de-all-v2024.08.29/). The relative path within the git repo for this generated manifest would then be: data-processing/topics/tm-de-all-v2024.08.29/topics_v*-*-*.json.
Optionally, it will be possible to define this relative path (to make it simpler for example). I will then be necessary to be alert to the value used for this git relative path, making sure that it stays consistent from one time to the next, noting it will default to the s3 partition.
The text was updated successfully, but these errors were encountered:
Embeddings are not ready yet to be versioned, and linguistic processing either (only some titles are present on the S3).
As a result, I'll merge the current branch which contains some bugfixes and addition of new aggregators, and will repeat the process when including the embeddingsd aggregators.
Aggregators for new data stages are added iteratively as the data is ready.
Once the Polar night release is fully done, and all manifests have been generated this issue will be able to be closed.
Currently, the stages for which aggregating functions were added are:
Text-reuse
Topics
news agencies
image embeddings
linguistic processing
Still missing stages are:
stages relating to embeddings (word/doc/entities etc)
stages retating to solr ingestion
e-maud
changed the title
Add aggregators for all missing data stages
Manifests: Add aggregators for all missing data stages
Jan 16, 2025
We now have generated data for several data stages for which we can't compute manifests yet.
Hence, this issue aims at listing all the data stages for which new aggregating functions should be implemented, which should be added to
compute_manifest.py
as an option.The types are the following:
s3://42-processed-data-final/topics/
)s3://42-processed-data-final/embeddings/articles
)s3://42-processed-data-final/embeddings/pages
)s3://42-processed-data-final/lingproc
)Note that for all data stages which are part of data processing, we can have multiple versions which all stem from the same input data, and have been generated at the same time, simply with different models or parameters. As a result, in order to prevent confusion inside the impresso-data-release repository, the full s3 partition inside the bucket will also be used as path within the repo.
Eg.:
topics have three different types of outputs: for french, english and german, which are all in their own s3 partition (eg
s3://42-processed-data-final/topics/tm-de-all-v2024.08.29/
). The relative path within the git repo for this generated manifest would then be:data-processing/topics/tm-de-all-v2024.08.29/topics_v*-*-*.json
.Optionally, it will be possible to define this relative path (to make it simpler for example). I will then be necessary to be alert to the value used for this git relative path, making sure that it stays consistent from one time to the next, noting it will default to the s3 partition.
The text was updated successfully, but these errors were encountered: