diff --git a/docs/deepvariant-pacbio-model-case-study.md b/docs/deepvariant-pacbio-model-case-study.md index e8c67157..d497bab1 100644 --- a/docs/deepvariant-pacbio-model-case-study.md +++ b/docs/deepvariant-pacbio-model-case-study.md @@ -1,30 +1,29 @@ -# Using DeepVariant for small variant calling from PacBio CCS reads +# Using DeepVariant for small variant calling from PacBio HiFi reads -In this case study we describe applying DeepVariant to PacBio CCS reads to call +In this case study we describe applying DeepVariant to PacBio HiFi reads to call variants. We will call small variants from a publicly available whole genome -CCS dataset from PacBio. +HiFi dataset from PacBio. -Starting from v0.10.0, sequence from amplified libraries is included in our -PacBio CCS training set, providing a significant accuracy boost to variant -detection from amplified CCS data. +In v0.8, DeepVariant released a model for PacBio HiFi data. Starting from +v0.10.0, sequence from amplified libraries is included in our PacBio HiFi +training set, providing a significant accuracy boost to variant detection from +amplified HiFi data. +In this case study we will apply the PacBio model by specifying `PACBIO` in +the `model_type` parameter in the `run_pacbio_case_study_docker.sh` script. -Case study is run on a standard Google Cloud instance. There are no special +This case study is run on a standard Google Cloud instance. There are no special hardware or software requirements for running this case study. For consistency we use Google Cloud instance with 64 cores and 128 GB of memory. This is NOT the fastest or cheapest configuration. For more scalable execution of DeepVariant see the [External Solutions] section. -In v0.8 DeepVariant released a model for PacBio CCS data. In this case study we -will apply PacBio model by specifying `PACBIO` in `model_type` parameter in the -`run_pacbio_case_study_docker.sh` script. - ## Case study overview Calling small variants using DeepVariant involves multiple steps: -1. Creating examples. Variants candidates are extracted from input BAM file +1. Creating examples. Candidate variants are extracted from an input BAM file (previously aligned). -2. Calling Variants. Applying DeepVariant Convolutional Neural Network (CNN) +2. Calling variants. Applying DeepVariant convolutional neural network (CNN) model to infer variants. 3. Exporting results to VCF. @@ -37,13 +36,13 @@ There are multiple ways to run DeepVariant: - Download prebuilt binaries. - Download an official DeepVariant Docker image. -This case study is run using official DeepVariant Docker image. +This case study is run using the official DeepVariant Docker image. ## Running For simplicity we provide a script that downloads the input data and runs all the steps described above using DeepVariant Docker image. **Please note, that if -you create your own script make_examples must be called with +you create your own script `make_examples` must be called with `--norealign_reads --vsc_min_fraction_indels 0.12` flag for PacBio long reads.** 1. Create a Google Cloud virtual instance. This command creates a virtual @@ -74,7 +73,7 @@ curl https://raw.githubusercontent.com/google/deepvariant/r0.10/scripts/run_pacb ## Script description -Before running DeepVariant steps following input data is downloaded: +Before running the DeepVariant steps, the following input data is downloaded: * BAM file: pacbio.8M.30x.bam. Publicly available PacBio BAM file. [ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ @@ -90,14 +89,15 @@ Before running DeepVariant steps following input data is downloaded: downloaded from [ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/](ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/) -Next following steps are executed: +Next the following steps are executed: -* make_examples. This step creates small variant candidates and stores them in - TensorFlow format. +* `make_examples`. This step creates small variant candidates and stores them + in TensorFlow format. -* call_variants. This step applys DeepVariant DNN to call small variants. +* `call_variants`. This step applies DeepVariant DNN to call small variants. -* postprocess_variants. This step converts data from TensorFlow format to VCF. +* `postprocess_variants`. This step converts data from TensorFlow format to + VCF. * `hap.py` ([https://github.com/Illumina/hap.py]) program from Illumina is used to evaluate the resulting vcf file. This serves as a check to ensure @@ -114,9 +114,9 @@ Step | Wall time ## Accuracy metrics -PacBio model was trained using HG002 genome (the same genome we use for this -case study) with chromosomes 20, 21, 22 excluded. Therefore, we run evaluation -on chr20. +The PacBio model was trained using the HG002 genome (the same genome we use for +this case study) with chromosomes 20, 21, 22 excluded. Therefore, we run +evaluation on chr20. Type | # TP | # FN | # FP | Recall | Precision | F1\_Score ----- | ----- | ---- | ---- | -------- | --------- | ---------