diff --git a/_posts/2024-09-04-blindfolding-deepvariant-surprising-insights-from-hiding-information.markdown b/_posts/2024-09-04-blindfolding-deepvariant-surprising-insights-from-hiding-information.markdown
index a98e450d..704a16ef 100644
--- a/_posts/2024-09-04-blindfolding-deepvariant-surprising-insights-from-hiding-information.markdown
+++ b/_posts/2024-09-04-blindfolding-deepvariant-surprising-insights-from-hiding-information.markdown
@@ -20,7 +20,7 @@ authors: ["msamman", "danielecook", "awcarroll", "lucasbrambrink"]
}
@media (min-width: 1200px) {
max-width: 1100px;
- margin-left: -125px;
+ margin-left: -150px;
}
}
figcaption {
@@ -62,7 +62,7 @@ All DeepVariant models generally contain the following six base channels:
- Figure 1: A single pileup image (called an Example) composed of multiple channels.
+ Figure 1: A single pileup image (called an Example) composed of multiple channels.
The set of channels used by DeepVariant has changed over time. One of the earliest versions of DeepVariant encoded only four features: `read_base`, `base_quality`, `strand`, and `base_differs_from_ref`. Through trial and error, we arrived at the set of base channels listed above for all our models. In `v0.5.0`, we removed a channel that encoded cigar operation length (e.g. the length of a deletion or insertion event) to improve the generalizability of models. We have also added channels that are tailored towards specific sequencing platforms to improve accuracy. For example, [we added a haplotype channel](https://google.github.io/deepvariant/posts/2021-02-08-the-haplotype-channel/) to our PacBio model ([Release 1.1.0](https://github.com/google/deepvariant/releases/tag/v1.1.0)), and we added an insert-size channel to our Illumina models ([Release 1.4.0](https://github.com/google/deepvariant/releases/tag/v1.4.0)).
@@ -74,7 +74,7 @@ In order to gain a better understanding of each channel's contribution to overal
-
+
Figure 2(a): A pileup image with the base_differs_from_ref channel ablated.
@@ -85,7 +85,7 @@ The second set of models were trained on just a **single** channel chosen from t
- Figure 2(b): A single channel pileup image, showing only read_base information.
+ Figure 2(b): A single channel pileup image, showing only read_base information.
@@ -95,7 +95,7 @@ Included in our set of single channel experiments is a model trained on a comple
- Figure 2(c): An example of a blank channel, containing no information about reads or reference.
+ Figure 2(c): An example of a blank channel, containing no information about reads or reference.
@@ -112,7 +112,7 @@ We first focus our attention on the ablation models, in which each model is miss
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_5.png"
alt="Figure 3: F1 Scores of ablation models"
class="large-image"/>
-
+
Figure 3: F1 Scores of ablation models. Instead of the traditional six base channels, these models had one channel missing from the examples, effectively hiding the information contained in the ablated channel.
@@ -127,7 +127,7 @@ Before we try to answer what critical information the `read_supports_variant` ch
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_6.png"
alt="Figure 4: F1 Scores of single channel models"
class="large-image"/>
-
+
Figure 4: F1 Scores of single channel models compared to baseline. Instead of the traditional six base channels, these models kept just one channel in the examples. In consequence, these models operated in a much lower information environment.
@@ -150,7 +150,7 @@ To try to answer this question, let’s break up our F1 scores by genotype. Reme
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_7.png"
alt="Figure 5: F1 Scores of ablation models computed per genotype"
class="large-image"/>
-
+
Figure 5: F1 Scores of ablation models computed per genotype, showing the global F1 score in the left most column for comparison. A clear drop in hetalt performance is observed when ablating the read_supports_variant channel.
@@ -164,7 +164,7 @@ We can see at a glance that the `ablate_read_supports_variant` model stands out
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_8.png"
alt="Figure 6: Genotype distribution in the HG003 truth set for SNPs and INDELs"
class="large-image"/>
-
+
Figure 6: Genotype distribution in the HG003 truth set for SNPs and INDELs.
@@ -181,7 +181,7 @@ DeepVariant classifies a given example into three classes: `{0/0, 0/1, 1/1}`, th
-
+
Figure 7: A snapshot of an IGV alignment showing two possible SNPs, a comparatively rare multiallelic SNP being shown on the left and a more common biallelic SNP on the right.
@@ -206,7 +206,7 @@ To illustrate this, shown below are the three examples produced for a multiallel
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_10c.png"
alt="Figure 8(c): SNP for chr3:163362557_T->TAC|TACAC"
class="large-image"/>
-
+
Figure 8: The set of examples showing the three possible representations of a single multiallelic locus. Only the read_supports_variant channel encodes different information across the three examples, since it encodes if a given read supports G→A (top row), G→T (second row) or G→A|T (A or T, third row).
@@ -239,7 +239,7 @@ Based on the above reasoning, we would expect to observe the same genotype-speci
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_11.png"
alt="Figure 9: F1 Scores of single channel models computed per genotype"
class="large-image"/>
-
+
Figure 9: F1 Scores of single channel models computed per genotype, showing the global F1 score in the left most column for comparison. A clear drop in SNP homalt performance is observed for channels that do not directly encode allele information. This is not observed with INDELs.
@@ -256,7 +256,7 @@ This can be seen even more clearly when we look at the distribution of genotype
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_12.png"
alt="Figure 10: Absolute number of genotypes called by each model"
class="large-image"/>
-
+
Figure 10: Absolute number of genotypes called by each model. It is clearly observed that the blank model deterministically classifies each example as het.
@@ -274,7 +274,7 @@ Let’s look at the homozygous SNP `chr2:522921`, a `G → A` mutation.
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_13.png"
alt="Figure 11: All channel encodings of a homozygous SNP"
class="large-image"/>
-
+
Figure 11: All channel encodings of a homozygous SNP. The three channels in the top row encode allele information, while the channels in the bottom row do not.
@@ -295,7 +295,7 @@ So how is it possible for the bottom row models to call heterozygous variants re
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_14.png"
alt="Figure 12: Absolute number of genotypes called by each model"
class="large-image"/>
-
+
Figure 12: Absolute number of genotype mistakes made by single channel models. A clear pattern emerges that homalt SNPs are being called as het, a classification error not observed in INDELs.
@@ -317,7 +317,7 @@ Let’s look at a pair of heterozygous and homozygous deletions that were called
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_15.png"
alt="Figure 13: Two examples of deletions: a heterozygous deletion (top) and homozygous alternate (bottom)"
class="large-image"/>
-
+
Figure 13: Two examples of deletions: a heterozygous deletion (top) and homozygous alternate (bottom). DeepVariant represents deletions as blank spaces within the read.
@@ -331,7 +331,7 @@ The same is not true for insertions. DeepVariant essentially encodes insertions
-
+
Figure 14(a): An example of an insertion illustrates how DeepVariant collapses the alternate alleles to their first base only.
@@ -342,7 +342,7 @@ Which begs the question, how is it possible for DeepVariant to call insertions r
-
+
Figure 14(b): Multiple insertion loci encoded by the mapping_quality channel are shown, illustrating how they appear to contain no discernible information to differentiate genotypes (being het, homalt and het, respectively).
@@ -355,7 +355,7 @@ We would expect that the models that struggle to differentiate `het` and `homalt
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_17.png"
alt="Figure 15: F1 scores of single channel models compared across insertions and deletions"
class="large-image"/>
-
+
Figure 15: F1 scores of single channel models compared across insertions and deletions. There is a clear difference in homalt performance between insertion and deletions.
@@ -369,7 +369,7 @@ The answer lies in the read length distribution. Illumina short-read sequencing
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_18.png"
alt="Figure 16: The distribution of the average read length per example across all candidates in the HG003 Illumina WGS case study"
class="large-image"/>
-
+
Figure 16: The distribution of the average read length per example across all candidates in the HG003 Illumina WGS case study.
@@ -382,7 +382,7 @@ Because DeepVariant collapses the insertions—that is, representing them by the
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_19.png"
alt="Figure 17: The distribution of the average read length per example broken down by SNP, deletion, and multiple ranges of insertion sizes"
class="large-image"/>
-
+
Figure 17: The distribution of the average read length per example broken down by SNP, deletion, and multiple ranges of insertion sizes (1-5, 6-10, 11-15, and 15+, respectively).
@@ -394,7 +394,7 @@ Furthermore, since `het` and `homalt` differ in the number of reads supporting t
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_20.png"
alt="Figure 18: The distribution of the average read length per example comparing het vs homalt variants, across SNP, deletion, and multiple ranges of insertion sizes"
class="large-image"/>
-
+
Figure 18: The distribution of the average read length per example comparing het vs homalt variants, across SNP, deletion, and multiple ranges of insertion sizes.
@@ -410,7 +410,7 @@ For example, suppose the `only_mapping_quality` model encounters an example with
src="{{ site.baseurl }}/assets/images/2024-09-04/figure_21.png"
alt="Figure 19: The distribution of the average read length per example comparing het vs homalt variants across errors (FP+FN) and TPs"
class="large-image"/>
-
+
Figure 19: The distribution of the average read length per example comparing het vs homalt variants across errors (FP+FN) and TPs. A higher mean for homalt errors suggests that DeepVariant incorrectly classifies them according to the read length distribution.
diff --git a/assets/images/2024-09-04/figure_18.png b/assets/images/2024-09-04/figure_18.png
index c211c645..11b11f56 100644
Binary files a/assets/images/2024-09-04/figure_18.png and b/assets/images/2024-09-04/figure_18.png differ