Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 path conventions: Pending decisions on path component sequence #12

Open
simon-clematide opened this issue Nov 20, 2024 · 0 comments · May be fixed by #14
Open

s3 path conventions: Pending decisions on path component sequence #12

simon-clematide opened this issue Nov 20, 2024 · 0 comments · May be fixed by #14
Assignees

Comments

@simon-clematide
Copy link
Collaborator

We should provide a schema that can cope with the following data and processing developments:

  • different media that will be processed in various ways and should not be specific to a certain processing pipeline:

    • images
    • texts
    • audio
  • different types of enrichments and transformations:

    • embeddings
    • enrichments (langident, topics, entities, textreuse clusters...)
  • different models and processing pipeline applied to this media:

  • different versions for running the these models and pipelines

Current google document for brain storming:

Suggestion:

PATH = "s3://" BUCKET "/" PROCESSING_LABEL  "/" RUN_ID [ "/" PROCESSING_STEP] [ "/" PROVIDER_ALIAS ]  "/" MEDIA_ALIAS  "/" FILE_STEM ".jsonl.bz2" ;
BUCKET = STAGE_NUMBER "-processed-data-" PHASE ;
STAGE_NUMBER = /\d\d/ ;
PHASE = "sandbox" | "staging" | "final";
PROCESSING_LABEL =  "image-embeddings" | "entity-embeddings" | "entities" | "langident" | "lingproc" | "ocrqa" | "newsagencies" | "topics" | "textreuse" ;  
PROCESSING_SUBTYPE_LABEL = "embeddings" |"images" | "entities" | "component" | "vectors" | "clusters" ;     (* @Todecide: move it after RUN_ID; components => Own Media bucket; e.g. 32-image-data-final *) 
RUN_ID = PROCESSING_LABEL "-" MODEL_ID "_" RUN_VERSION ;                   (* What should be repeated here at the begin? *)
MODEL_ID = TASK [ "_" SUBTASK ] "-" MODEL_SPECIFITY [ "_" MODEL_VERSION ] "-" LANG  ;
RUN_VERSION = "v" MAJOR "-" MINOR "-" PATCH ;
MAJOR = /\d+/ ;
MINOR = /\d+/ ;
PATCH = /\d+/ ;
TASK = "ner" | "nel" | "tm" | "emb"| "lid"| "pos" ;                        (* |... @TODO  *)
SUBTASK = "newsagency" ;                                                   (* |... @TODO *)
MODEL_SPECIFITY = /[A-Za-z][A-Za-z_]*/ ;                                   (* we could allow "-" here, given that "-" LANG is mandatory *)
MODEL_VERSION = "v" MAJOR "." MINOR "." PATCH ;
LANG = "de" | "fr" | "en" | "lb" | "multilingual" | "" ;                   (* what to do with language of images: "XX" *)
PROVIDER_ALIAS = /[A-Za-z]+/ ;
MEDIA_ALIAS = /[A-Za-z][A-Za-z0-9]*/ ;
FILE_STEM =  MEDIA_ALIAS "-" YEAR [ "-" PROCESSING_LABEL ] ;
YEAR = /\d\d\d\d/ ;
@piconti piconti self-assigned this Jan 15, 2025
@piconti piconti linked a pull request Jan 15, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants