Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifests: Add aggregators for all missing data stages #8

Open
4 of 6 tasks
piconti opened this issue Oct 8, 2024 · 2 comments
Open
4 of 6 tasks

Manifests: Add aggregators for all missing data stages #8

piconti opened this issue Oct 8, 2024 · 2 comments
Assignees

Comments

@piconti
Copy link
Member

piconti commented Oct 8, 2024

We now have generated data for several data stages for which we can't compute manifests yet.
Hence, this issue aims at listing all the data stages for which new aggregating functions should be implemented, which should be added to compute_manifest.py as an option.

The types are the following:

  • text-reuse (based on the text-reuse passages)
  • news-agencies (same schema as entities, they will have exactly the same keys and same aggregations, just a different manifest name)
  • topics (based on s3://42-processed-data-final/topics/)
  • article embeddings (based on s3://42-processed-data-final/embeddings/articles)
  • page embeddings (based on s3://42-processed-data-final/embeddings/pages)
  • linguistic processing (based on s3://42-processed-data-final/lingproc)

Note that for all data stages which are part of data processing, we can have multiple versions which all stem from the same input data, and have been generated at the same time, simply with different models or parameters. As a result, in order to prevent confusion inside the impresso-data-release repository, the full s3 partition inside the bucket will also be used as path within the repo.
Eg.:
topics have three different types of outputs: for french, english and german, which are all in their own s3 partition (eg s3://42-processed-data-final/topics/tm-de-all-v2024.08.29/). The relative path within the git repo for this generated manifest would then be: data-processing/topics/tm-de-all-v2024.08.29/topics_v*-*-*.json.

Optionally, it will be possible to define this relative path (to make it simpler for example). I will then be necessary to be alert to the value used for this git relative path, making sure that it stays consistent from one time to the next, noting it will default to the s3 partition.

@piconti piconti self-assigned this Oct 8, 2024
@piconti
Copy link
Member Author

piconti commented Oct 9, 2024

Embeddings are not ready yet to be versioned, and linguistic processing either (only some titles are present on the S3).
As a result, I'll merge the current branch which contains some bugfixes and addition of new aggregators, and will repeat the process when including the embeddingsd aggregators.

@piconti piconti mentioned this issue Oct 9, 2024
@piconti
Copy link
Member Author

piconti commented Jan 15, 2025

Aggregators for new data stages are added iteratively as the data is ready.
Once the Polar night release is fully done, and all manifests have been generated this issue will be able to be closed.

Currently, the stages for which aggregating functions were added are:

  • Text-reuse
  • Topics
  • news agencies
  • image embeddings
  • linguistic processing

Still missing stages are:

  • stages relating to embeddings (word/doc/entities etc)
  • stages retating to solr ingestion

@e-maud e-maud changed the title Add aggregators for all missing data stages Manifests: Add aggregators for all missing data stages Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant