You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal of this ticket is to reduce long lived connections during parse workflows in order to reduce the incidence of network disconnects causing crashes.
This will be done by writing files locally first, which are then uploaded in one faster operation which can also be retried since the data is preserved locally.
Cloud Run job containers filesystem is an in-memory tmpfs. They can mount a bucket as a volume but this doesn't really help us reduce connection times if the file writers are still keeping open a blob for writing for 5+ hours.
Can just expand the in-memory tmpfs but this is likely to be expensive (how much?) and has a hard upper limit of 32GB. This limit is okay though since the clinvar_vcv_2025_01_14_v2_1_0 dataset is only 3.3GB in the version now which uses 1 NDJSON.gz file per entity_type.
One option to reduce storage needed and maybe also speed up the bq-ingest step is to start partitioning our output files instead of writing one big NDJSON per entity. We could pick some number like 10,000 and write a maximum of 10,000 lines per file. For 3.3 million VCVs this would be 330 files. We could also pick 100,000. Which would only really come into play for the entities with a lot of records. The submitters would still all fit in one file. After a local output file contains 100,000 lines, it gets closed, uploaded to the bucket (and maybe deleted locally) and a new file is opened locally to start receiving additional rows.
The text was updated successfully, but these errors were encountered:
The goal of this ticket is to reduce long lived connections during parse workflows in order to reduce the incidence of network disconnects causing crashes.
This will be done by writing files locally first, which are then uploaded in one faster operation which can also be retried since the data is preserved locally.
Cloud Run job containers filesystem is an in-memory tmpfs. They can mount a bucket as a volume but this doesn't really help us reduce connection times if the file writers are still keeping open a blob for writing for 5+ hours.
https://cloud.google.com/run/docs/configuring/jobs/cloud-storage-volume-mounts
Can just expand the in-memory tmpfs but this is likely to be expensive (how much?) and has a hard upper limit of 32GB. This limit is okay though since the
clinvar_vcv_2025_01_14_v2_1_0
dataset is only 3.3GB in the version now which uses 1 NDJSON.gz file per entity_type.One option to reduce storage needed and maybe also speed up the bq-ingest step is to start partitioning our output files instead of writing one big NDJSON per entity. We could pick some number like 10,000 and write a maximum of 10,000 lines per file. For 3.3 million VCVs this would be 330 files. We could also pick 100,000. Which would only really come into play for the entities with a lot of records. The submitters would still all fit in one file. After a local output file contains 100,000 lines, it gets closed, uploaded to the bucket (and maybe deleted locally) and a new file is opened locally to start receiving additional rows.
The text was updated successfully, but these errors were encountered: