Change output file writing to use batched local cache and uploads #286

theferrit32 · 2025-01-21T21:46:04Z

The goal of this ticket is to reduce long lived connections during parse workflows in order to reduce the incidence of network disconnects causing crashes.

This will be done by writing files locally first, which are then uploaded in one faster operation which can also be retried since the data is preserved locally.

Cloud Run job containers filesystem is an in-memory tmpfs. They can mount a bucket as a volume but this doesn't really help us reduce connection times if the file writers are still keeping open a blob for writing for 5+ hours.

https://cloud.google.com/run/docs/configuring/jobs/cloud-storage-volume-mounts

Can just expand the in-memory tmpfs but this is likely to be expensive (how much?) and has a hard upper limit of 32GB. This limit is okay though since the clinvar_vcv_2025_01_14_v2_1_0 dataset is only 3.3GB in the version now which uses 1 NDJSON.gz file per entity_type.

One option to reduce storage needed and maybe also speed up the bq-ingest step is to start partitioning our output files instead of writing one big NDJSON per entity. We could pick some number like 10,000 and write a maximum of 10,000 lines per file. For 3.3 million VCVs this would be 330 files. We could also pick 100,000. Which would only really come into play for the entities with a lot of records. The submitters would still all fit in one file. After a local output file contains 100,000 lines, it gets closed, uploaded to the bucket (and maybe deleted locally) and a new file is opened locally to start receiving additional rows.

The text was updated successfully, but these errors were encountered:

theferrit32 mentioned this issue Jan 28, 2025

VCV ingest of 2024-11-03 failed due to network error on Dec 5 #262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change output file writing to use batched local cache and uploads #286

Change output file writing to use batched local cache and uploads #286

theferrit32 commented Jan 21, 2025

Change output file writing to use batched local cache and uploads #286

Change output file writing to use batched local cache and uploads #286

Comments

theferrit32 commented Jan 21, 2025