In order to rapidly load data from diverse sources into accessible formats, slice it into useful chunks, identify new records, and apply existing per-record analysis tools, we use a set of custom scripts based on GNU parallel. For example, GISAID's json data provision is parsed into metadata tsvs and zstd-compressed fasta files batched based on GISAID's lineage call; having less sequence variation in each batch speeds up downstream operations like minimap2. Once this is done, the new data is diffed from the old, and output and analysis protocols are run.

Our aggregation and analysis pipelines are parallelized using Dask, either via Dask DataFrames or XArray's Dask integration. Efficient use of these tools requires careful system configuration, data arrangement, and operation choice, which can make implementation difficult. We use a set of interconnected scripts for computing estimated coronavirus case counts and relative growth rates along with the associated uncertainty.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

technical.md

technical.md

Files

technical.md

Latest commit

History

technical.md

File metadata and controls