Releases: google/deepvariant
DeepVariant 0.9.0
- In the v0.9.0 release, we introduce best practices for merging DeepVariant samples.
- Added visualizations of variant output for visual QC and inspection.
- Improved Indel accuracy for WGS and WES (error reduction of 36% on the WGS case study) by reducing Indel candidate generation threshold to 0.06.
- Improved WES model accuracy by expanding training regions with a 100bp buffer around capture regions and additional training at lower exome coverages.
- Improved performance for new PacBio Sequel II chemistry and CCS v4 algorithm by training on additional data.
Full release notes:
New documentation:
- Added a tutorial for merging WES trio.
- Visualization functionality and documentation: VCF stats report.
Changes to Docker images, code, and models:
- Docker images now live in Docker Hub google/deepvariant in addition to gcr.io/deepvariant-docker/deepvariant.
- For WES, added 100bps buffer to the capture regions when creating training examples.
- For WES, increased training examples with lower coverage exomes, down to 30x.
- For PACBIO, added training data for Sequel II v2 chemistry and samples processed with CCS v4 algorithm.
- Loosened the restriction that the BAM files need to have exactly one sample_name. Now if there are multiple samples in the header, use the first one. If there was none, use a default.
- Changes in realigner code. Realigner aligns reads to haplotypes first and then realigns them to the reference. With this change some of the haplotypes (with not enough read support) are now discarded. This results in fewer reads needing to be realigned. Theoretically, this fix should improve FP rate. It also helps to resolve a GitHub issue.
Changes to flags:
- Added
--sample_name
flag to run_deepvariant.py. - Reduced default for
vsc_min_fraction_indels
to 0.06 for Illumina data (WGS
andWES
mode) which increases sensitivity. - Expanded the use of
--reads
to take multiple BAMs in a comma-separated list. - Use
--ref
for CRAM by default. (Set--use_ref_for_cram
to true by default) - Added support for BAM output for realigner debugging. See
--realigner_diagnostics
and--emit_realigned_reads
flags in realigner.py.
DeepVariant 0.8.0
With the v0.8.0 release, we introduce a new DeepVariant model for PacBio CCS data. This model can be run in the same manner as the Illumina WGS and WES models. For more details, see our manuscript with PacBio and our blog post.
This release also includes general improvements to DeepVariant and the Illumina WGS and WES models. These include:
- New script that lets the users run DeepVariant in one command. See Quick Start.
- Improved accuracy for NovaSeq samples, especially PCR-Free ones, achieved by adding NovaSeq samples to the training data. See DeepVariant training data.
- Improved accuracy for low coverage (30x and below), achieved by training on a broader mix of downsampled data. See DeepVariant training data.
- Overall speed improvements which reduce runtime by ~24% on WGS case study:
- Speed improvements in querying SAM files and doing calculations with Reads and Ranges.
- Fewer unnecessary copies when constructing DeBrujin graphs.
- Less memory usage when writing BED, FASTQ, GFF, SAM, and VCF files.
- Speed improvements in postprocess_variants when creating gVCFs - achieved by combining writing and merging for both VCF and gVCF.
- Improved support for CRAM files, allowing the use of a provided reference file instead of the embedded reference. See the
use_ref_for_cram
flag below.
New optional flags:
make_examples.py
use_ref_for_cram
:
Default is False (using the embedded reference in the CRAM file). If set to True,--ref
will be used as the reference instead. See CRAM support section for more details.parse_sam_aux_fields
anduse_original_quality_scores
:
Option to read base quality scores from OQ tag. To use this option, set both flags to true.
Standard GATK process includes a score re-calibration stage where base quality scores are re-calibrated using special software. DeepVariant produces a slightly better accuracy when original scores are used. Usually original scores are stored in a BAM file under OQ optional tag. This feature will allow to read quality scores from OQ tag instead of QUAL field.min_base_quality
:
Allowed users to try different thresholds for minimum base quality score.min_mapping_quality
:
Allowed users to try different thresholds for minimum mapping quality score.
call_variants.py
config_string
:
Allowed users to specify estimator session configuration through a flag when running on CPU and GPU, thanks to the contribution of @A-Tsai from ATGENOMIX in #159.num_mappers
:
Allowed users to modify the number of dataset mappers through a flag, thanks to the contribution of @fo40225 from National Taiwan University Hospital in #152.
DeepVariant 0.7.2
- Htslib updated to v1.9, fixing an outstanding CRAM issue.
- Fix for the issue of non-deterministic output caused by changing number of shards in the make_example process.
- Upgrade to TensorFlow v1.12.
- Speed improvements in make_examples via the use of a flat_hash_map.
- Speed improvements in call_variants.
- The genotypes of low-quality (GQ < 20) homozygous reference calls are set to
./.
instead of0/0
. The threshold is configurable via--cnn_homref_call_min_gq
flag inpostprocess_variants.py
. This improves downstream cohort merging performance based on our internal investigation in a "Improved non-human variant calling using species-specific DeepVariant models" blog. - Google Cloud Runner:
- Localize BED region files (given via --region flag), fixing an outstanding issue.
- Make worker logs available in case of a failure inside DeepVariant.
DeepVariant 0.7.1
- Fix for postprocess_variants - the previous version crashes if the first shard contains no records.
- Update the TensorFlow version dependency to 1.11.
- Added support to build on Ubuntu 18.04.
- Documentation changes: Move the commands in WGS and WES Case Studies into scripts under scripts/ to make it easy to run.
- Google Cloud runner:
- Added
batch_size
in case the users need to change it for the call_variants step. - Added
logging_interval_sec
to control how often worker logs are written into Google Cloud Storage. - Improved the use of
call_variants
: only onecall_variants
is run on each machine for better performance. This improved the GPU cost and speed.
- Added
DeepVariant 0.7.0
This release includes numerous performance improvements that collectively reduce the runtime of DeepVariant by about 65%.
A few highlighted changes in this release:
- Update TensorFlow version to 1.9 built by default with Intel MKL support, speeding up
call_variants
runtime by more than 3x compared to v0.6. - The components that use TensorFlow (both inference and training) can now be run on Cloud TPUs.
- Extensive optimizations in
make_examples
which result in significant runtime improvements. For example,make_examples
now runs more than 3 times faster in the WGS case study than v0.6.- New realigner implementation (fast_pass_aligner.cc) with parameters re-tuned using Vizier for better accuracy and performance.
- Changed window selector to use a linear decision model for choosing realignment candidates. This can be controlled by a flag.
-ws_use_window_selector_model
which is now on by default. - Many micro-optimizations throughout the codebase.
- Added a new training case study showing how to train and fine-tune DeepVariant models.
- Added support for CRAM files
DeepVariant 0.6.1
- Update the build scripts and header files so that it builds successfully on Debian.
- Include a script that demonstrates how to build the CLIF binary we released.
- Update GCP runner's default #cores.
- Small code fix: Fix the call_variants issue of crashing on empty shards.
DeepVariant 0.6.0
This release has a new WGS model that has major accuracy improvement on PCR+ data. We also released a new WES model that has some minor accuracy improvement.
A few important changes in this release:
- Changes in the training data for the WGS model:
- Addition:
- 3 replicates of HG001 (PCR+, HiSeqX) provided by DNAnexus
- 2 replicates of HG001 (PCR+, NovaSeq) from BaseSpace public data.
- Removal:
- WES data
(In v0.5.0, we trained our WGS model with WGS+WES data. This time we found that it didn’t help with WGS accuracy, so we removed them)
- WES data
- Addition:
- Improved training data labels. See haplotype_labeler.py
- For direct inputs/outputs from cloud storage, we no longer support direct file I/O (like gs://deepvariant) due to bugs in htslib. Instead we recommend using gcsfuse to read/write data directly on GCS buckets. See “Inputs and Outputs” in DeepVariant user guide.
DeepVariant 0.5.2
This release is a bugfix release for gVCF creation. See #58 for details.
DeepVariant v0.5.1
This release fixes issue #27 and adds support for creating the MIN_DP field in gVCF records.
DeepVariant 0.5.0
-
Release two separate models for calling genome and exome sequencing data. Significant improvement of Indel F1 on exome data.
- On exome sequencing data (HG002):
- Indel F1 0.936959 --> 0.961724; SNP F1 0.998636 --> 0.998962
- On whole genome sequencing data (HG002):
- Indel F1 0.996632 --> 0.996684; SNP F1 0.999495 --> 0.999542
- On exome sequencing data (HG002):
-
Provide capability to produce gVCF files as output from DeepVariant [doc]:
gVCF files are required as input for analyses that create a set of variants in a cohort of individuals, such as cohort merging or joint genotyping. -
Training data:
All models are trained with a benchmarking-compatible strategy: That is, we never train on any data from the HG002 sample, or from chromosome 20 from any sample.-
Whole genome sequencing model:
We used training data from both genome sequencing data as well as exome sequencing data.- WGS data:
- HG001: 1 from PrecisionFDA, and 8 replicates from Verily.
- HG005: 2 from Verily.
- WES data:
- HG001: 11 HiSeq2500, 17 HiSeq4000, 50 NovaSeq.
- HG005: 1 from Oslo University.
In order to increase diversity of training data, we also used the
downsample_fraction
flag when making training examples. - WGS data:
-
Whole exome sequencing model:
We started from a trained WGS model as a checkpoint, then we continue to train only on WES data above. We also use various downsample fractions for the training data.
-
-
DeepVariant now provides deterministic output by rounding QUAL field to one digit past the decimal when writing to VCF.
-
Update the model input data representation from 7 channels to 6.
- Removal of "Op-Len" (CIGAR operation length) as a model feature. In our tests this makes the model more robust to input that has different read lengths.
- Added an example for visualizing examples.
-
Add a post-processing step to variant calls to eliminate rare inconsistent haplotypes [description].
-
Expand the excluded contigs list to include common problematic contigs on GRCh38 [GitHub issue].
-
It is now possible to run DeepVariant workflows on GCP with pre-emptible GPUs.