You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PATH="s3://"BUCKET"/"PROCESSING_LABEL"/"RUN_ID [ "/"PROCESSING_STEP] [ "/"PROVIDER_ALIAS ] "/"MEDIA_ALIAS"/"FILE_STEM".jsonl.bz2" ;
BUCKET=STAGE_NUMBER"-processed-data-"PHASE ;
STAGE_NUMBER= /\d\d/ ;
PHASE="sandbox"|"staging"|"final";
PROCESSING_LABEL="image-embeddings"|"entity-embeddings"|"entities"|"langident"|"lingproc"|"ocrqa"|"newsagencies"|"topics"|"textreuse" ;
PROCESSING_SUBTYPE_LABEL="embeddings"|"images"|"entities"|"component"|"vectors"|"clusters" ; (* @Todecide: move it after RUN_ID; components => Own Media bucket; e.g. 32-image-data-final *)RUN_ID=PROCESSING_LABEL"-"MODEL_ID"_"RUN_VERSION ; (* What should be repeated here at the begin? *)MODEL_ID=TASK [ "_"SUBTASK ] "-"MODEL_SPECIFITY [ "_"MODEL_VERSION ] "-"LANG ;
RUN_VERSION="v"MAJOR"-"MINOR"-"PATCH ;
MAJOR= /\d+/ ;
MINOR= /\d+/ ;
PATCH= /\d+/ ;
TASK="ner"|"nel"|"tm"|"emb"|"lid"|"pos" ; (* |... @TODO *)SUBTASK="newsagency" ; (* |... @TODO *)MODEL_SPECIFITY= /[A-Za-z][A-Za-z_]*/ ; (* we could allow "-" here, given that "-" LANG is mandatory *)MODEL_VERSION="v"MAJOR"."MINOR"."PATCH ;
LANG="de"|"fr"|"en"|"lb"|"multilingual"|"" ; (* what to do with language of images: "XX" *)PROVIDER_ALIAS= /[A-Za-z]+/ ;
MEDIA_ALIAS= /[A-Za-z][A-Za-z0-9]*/ ;
FILE_STEM=MEDIA_ALIAS"-"YEAR [ "-"PROCESSING_LABEL ] ;
YEAR= /\d\d\d\d/ ;
The text was updated successfully, but these errors were encountered:
We should provide a schema that can cope with the following data and processing developments:
different media that will be processed in various ways and should not be specific to a certain processing pipeline:
different types of enrichments and transformations:
different models and processing pipeline applied to this media:
different versions for running the these models and pipelines
Current google document for brain storming:
Suggestion:
The text was updated successfully, but these errors were encountered: