s3 path conventions: Pending decisions on path component sequence #12

simon-clematide · 2024-11-20T11:27:15Z

We should provide a schema that can cope with the following data and processing developments:

different media that will be processed in various ways and should not be specific to a certain processing pipeline:
- images
- texts
- audio
different types of enrichments and transformations:
- embeddings
- enrichments (langident, topics, entities, textreuse clusters...)
different models and processing pipeline applied to this media:
different versions for running the these models and pipelines

Current google document for brain storming:

Suggestion:

PATH = "s3://" BUCKET "/" PROCESSING_LABEL  "/" RUN_ID [ "/" PROCESSING_STEP] [ "/" PROVIDER_ALIAS ]  "/" MEDIA_ALIAS  "/" FILE_STEM ".jsonl.bz2" ;
BUCKET = STAGE_NUMBER "-processed-data-" PHASE ;
STAGE_NUMBER = /\d\d/ ;
PHASE = "sandbox" | "staging" | "final";
PROCESSING_LABEL =  "image-embeddings" | "entity-embeddings" | "entities" | "langident" | "lingproc" | "ocrqa" | "newsagencies" | "topics" | "textreuse" ;  
PROCESSING_SUBTYPE_LABEL = "embeddings" |"images" | "entities" | "component" | "vectors" | "clusters" ;     (* @Todecide: move it after RUN_ID; components => Own Media bucket; e.g. 32-image-data-final *) 
RUN_ID = PROCESSING_LABEL "-" MODEL_ID "_" RUN_VERSION ;                   (* What should be repeated here at the begin? *)
MODEL_ID = TASK [ "_" SUBTASK ] "-" MODEL_SPECIFITY [ "_" MODEL_VERSION ] "-" LANG  ;
RUN_VERSION = "v" MAJOR "-" MINOR "-" PATCH ;
MAJOR = /\d+/ ;
MINOR = /\d+/ ;
PATCH = /\d+/ ;
TASK = "ner" | "nel" | "tm" | "emb"| "lid"| "pos" ;                        (* |... @TODO  *)
SUBTASK = "newsagency" ;                                                   (* |... @TODO *)
MODEL_SPECIFITY = /[A-Za-z][A-Za-z_]*/ ;                                   (* we could allow "-" here, given that "-" LANG is mandatory *)
MODEL_VERSION = "v" MAJOR "." MINOR "." PATCH ;
LANG = "de" | "fr" | "en" | "lb" | "multilingual" | "" ;                   (* what to do with language of images: "XX" *)
PROVIDER_ALIAS = /[A-Za-z]+/ ;
MEDIA_ALIAS = /[A-Za-z][A-Za-z0-9]*/ ;
FILE_STEM =  MEDIA_ALIAS "-" YEAR [ "-" PROCESSING_LABEL ] ;
YEAR = /\d\d\d\d/ ;

The text was updated successfully, but these errors were encountered:

piconti self-assigned this Jan 15, 2025

piconti linked a pull request Jan 15, 2025 that will close this issue

S3pathparser and contribution readme.md #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3 path conventions: Pending decisions on path component sequence #12

s3 path conventions: Pending decisions on path component sequence #12

simon-clematide commented Nov 20, 2024

s3 path conventions: Pending decisions on path component sequence #12

s3 path conventions: Pending decisions on path component sequence #12

Comments

simon-clematide commented Nov 20, 2024