Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change output file writing to use batched local cache and uploads #286

Open
theferrit32 opened this issue Jan 21, 2025 · 0 comments
Open

Comments

@theferrit32
Copy link
Contributor

The goal of this ticket is to reduce long lived connections during parse workflows in order to reduce the incidence of network disconnects causing crashes.

This will be done by writing files locally first, which are then uploaded in one faster operation which can also be retried since the data is preserved locally.

Cloud Run job containers filesystem is an in-memory tmpfs. They can mount a bucket as a volume but this doesn't really help us reduce connection times if the file writers are still keeping open a blob for writing for 5+ hours.

https://cloud.google.com/run/docs/configuring/jobs/cloud-storage-volume-mounts

Can just expand the in-memory tmpfs but this is likely to be expensive (how much?) and has a hard upper limit of 32GB. This limit is okay though since the clinvar_vcv_2025_01_14_v2_1_0 dataset is only 3.3GB in the version now which uses 1 NDJSON.gz file per entity_type.

One option to reduce storage needed and maybe also speed up the bq-ingest step is to start partitioning our output files instead of writing one big NDJSON per entity. We could pick some number like 10,000 and write a maximum of 10,000 lines per file. For 3.3 million VCVs this would be 330 files. We could also pick 100,000. Which would only really come into play for the entities with a lot of records. The submitters would still all fit in one file. After a local output file contains 100,000 lines, it gets closed, uploaded to the bucket (and maybe deleted locally) and a new file is opened locally to start receiving additional rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant