Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlapping SVs within individuals #1557

Open
maggs-x opened this issue Dec 5, 2024 · 30 comments
Open

Overlapping SVs within individuals #1557

maggs-x opened this issue Dec 5, 2024 · 30 comments

Comments

@maggs-x
Copy link

maggs-x commented Dec 5, 2024

Hi,

Thanks for your help with my previous question, Glenn. I have another question about a different pangenome I'm working on. We built it last year with cactus-minigraph pangenome pipeline (v2.5.1). The input for our pangenome are four chromosome level assemblies for four individuals, the fourth of which was used as the reference. So our vcf has three individuals (One, Two, and Three).

In the hal file and vcf we noticed structural insertions, for example, within individuals that overlap one another. Presumably, these should be represented by a single structural insertion. In more detail here is an example:

On chromosome 1 at site 671683 there is a 238bp insertion in individual One.
Then 8 base pairs away, at site 671691 there is a 270bp insertion in individual One.

I attached screenshots of the two variants in case this helps. I'm curious if you've run into this issue before. Please let me know.
screenshot2

screenshot1

@maggs-x
Copy link
Author

maggs-x commented Dec 5, 2024

Hi, please disregard my initial concern. It makes complete sense to have two structural insertions relatively close together when the coordinates are based on the reference. I just wanted to give you a heads up that when we normalized and left aligned this vcf with bcftools, all 5 structural insertions were reassigned to site 671683 making it seem like single individuals had multiple alleles at the same site. This doesn't make biological sense for our dataset given we did not input multiple haplotypes per individual. In case someone else runs into this issue, I figured its good to put on your radar.

@glennhickey
Copy link
Collaborator

Yeah, this sounds like #1493 and I suspect the problem is mostly in bcftools norm. There's a potential fix here #1536. I don't think it's ideal but likely better than nothing.

@maggs-x
Copy link
Author

maggs-x commented Dec 5, 2024

Thanks Glenn. And it makes complete sense that the CHR POS in the vcf after left-aligning doesn't match the hal file. Correct? [Reminder we ran cactus 2.5.1 so applied bcftools norm independently afterward].

I guess my concern with the merge_duplicates.py is that the appearance of the SV insertions as duplicate alleles is an artifact of bcftools norm. In the vcf after vcfbub, there are clearly 2 SV insertions as site 10 (let's say) and then 3 SV insertions at site 15. You can plot these clearly with a TubeMap. After bcftools norm, all 5 SV insertions are assigned to site 10. As I explained in the last comment, this doesn't make biological sense for our data. The merge_duplicates.py script looks like it can render a sensible looking vcf, but it'll condense all of these variants into a single variant per individual. I have my hesitations about this. Especially with plotting. For example, I can use the .vg file to create figures that align perfectly with the vcf after vcfbub. But it won't align with a vcf after bcftools norm + merge_duplicates.py. Sounds like we should weigh the costs and benefits of bcftools norm. Either analyze the vcf without any of this postprocessing, or postprocess and understand that there will be some inconsistencies between our plots and the vcf.

Thanks for your feedback. It really helped.

@glennhickey
Copy link
Collaborator

glennhickey commented Dec 6, 2024

Yeah, these misgivings are why I hadn't merged #1536, but I've procrastinated following up.

Anyway, I agree. I think we need merge_duplicates.py to merge the alleles, not just the sites.

So if I have

A  AAAAA    0 1 1
A  AAAAA    0 1 0

My new site would be

A  AAAAA,AAAAAAAAA    0 2 1

Since the second sample had two equivalent insertions, they would just be doubled up. I think a similar process should work for indels in general...

Does this make sense? @Han-Cao what do you think?

@maggs-x
Copy link
Author

maggs-x commented Dec 6, 2024 via email

@maggs-x
Copy link
Author

maggs-x commented Dec 6, 2024 via email

@Han-Cao
Copy link

Han-Cao commented Dec 7, 2024

Hi @maggs-x ,

I would like to clarify that merge_duplicates.py will not merge the variants you showed because it always checks the genotypes before merging. If there are 2 alleles at the same POS on the same haplotype, they will not be merged. This is also why it now only supports phased VCF, because we cannot tell whether 2 heterozygous unphased genotypes are ok to merge. For example, the unphased genotypes for below VCFs are the same:

This will be merged:

A  AAAAA    1|0 1|0
A  AAAAA    0|1 0|0

This will not be merged:

A  AAAAA    1|0 1|0
A  AAAAA    1|0 0|0

@glennhickey ,

Yes, the behavior you describe is more reasonable. I think the new version of the script could do:
Input:

A  AAAAA    0 1 1
A  AAAAA    1 0 0
A  AAAAA    0 1 0

Output:

A  AAAAA,AAAAAAAAA    1 2 1

Will let you know when it is ready.

@maggs-x
Copy link
Author

maggs-x commented Dec 7, 2024 via email

@Han-Cao
Copy link

Han-Cao commented Dec 8, 2024

Hi,

I have updated merge_duplicates.py. Please have a try.

For a small test VCF:

#CHROM	POS	ID	REF	ALT	Haplotype	Sample1	Sample2	Sample3
chr1	5	var1	A	AAA	0	0|0	1|0	0|1
chr1	5	var2	A	AAA	1	1|0	0|0	0|0
chr1	5	var3	A	AAA	.	0|0	1|0	0|0
chr1	5	var4	AAA	A	1	0|0	1|0	0|0
chr1	5	var5	AAA	A	1	0|0	1|1	0|1
chr1	6	var7	A	TTT	0	0|0	1|0	.|1
chr1	6	var8	A	TTT	0	0|0	1|0	1|0
chr1	6	var9	AA	TTTTTT	0	0|0	1|0	1|0

Output:

#CHROM	POS	ID	REF	ALT	Haplotype	Sample1	Sample2	Sample3
chr1	5	var1	A	AAA	1	1|0	0|0	0|1
chr1	5	var3	A	AAAAA	.	0|0	1|0	0|0
chr1	5	var4	AAA	A	0	0|0	0|1	0|1
chr1	5	var5	AAAAA	A	1	0|0	1|0	0|0
chr1	6	var7	A	TTT	0	0|0	0|0	1|1
chr1	6	var9	AA	TTTTTT	0	0|0	0|0	1|0
chr1	6	var8	AAAA	TTTTTTTTTTTT	0	0|0	1|0	.|0
  • var1-3: haplotype, sample 1 and 3 have 1 insertion on one haplotype, they merged into 1 record. Sample 2 has 2 alleles on hap1, its genotype and alleles are merged as shown in var3
  • var4-5: similar as above, but for a deletion
  • var7-9: a more complicated site, where sample 2 has a total of 4 copies of A -> TTT. The script will merge var7, 8, 9 recursively. I don't know if this exists in real dataset, just to show how the script works.

I am still not sure how to merge missing genotype with non-missing genotype. For example, if there are 2 duplicated A -> AAA:

  • when merging . and 0, we don't know whether there is 1 insertion (.), but it is impossible to have 2 insertions (0)
  • when merging . and 1, there is at least 1 insertion (1), but not sure if there are 2 insertions (.).

For samples without missing genotypes, having one insertion implies not having two insertions, and vice versa. So, it could be confusing to have any genotypes as missing. I added an option --merge-mis-as-ref to treat missing genotypes as reference when merging with non-missing genotypes. What do you think on merging missing genotypes?

Input:
A	AAA	0	1	1	.
A	AAA	.	.	1	.

Default output:
A	AAA	.	1	0	.
A	AAAAA	0	.	1	.

--merge-mis-as-ref
A	AAA	0	1	0	.
A	AAAAA	0	0	1	.

Finally, some evaluation when comparing with SNPs and small INDELs (LEN < 20) called from a linear reference genome (same as what I did in #1493). The new merging method slightly improves the performance.

no merge previous method new method --merge-mis-as-ref
Genotype concordance 0.6517 0.8509 0.8777 0.8776

@glennhickey
Copy link
Collaborator

Amazing @Han-Cao ! Thanks for your fast update. In my tests, I've been seeing that variants are only concatenated if they have identical alts. Would it be possible to generalize that a bit?

For example,

C CCC
C CCC

gets concatentated
but

C CCC
C CCCC

doesn't seem to. And the same thing seems to be the case for deletions.

@Han-Cao
Copy link

Han-Cao commented Dec 9, 2024

I was also thinking about this after testing the script. Merging repeats should not be difficult. But do you want to merge insertions with different sequences, like:

C CCC
C CTT

If the input VCF is sorted before left-align, would the first variant always be upstream to the second one? Can it be merged to C CCCTT? Complex variants with REF and ALT both longer than 1 base may be more complicated. Do you have any idea on how to merge them?

Anyway, I totally agree this feature is very useful. I am a bit busy this week, will find a time to work on it.

Update: I just realized that, if a variant can be left aligned, the variant I described may not exist, is it correct?

@glennhickey
Copy link
Collaborator

Right, I think left alignment ensures that ref/alt alleles are consistent substrings of each other, so that will hopefully avoid conflicts.

And I've been playing around with bcftools and it looks like the order is preserved. So if you have

C CCC
C CCCTT

Then I think concatenating in order

C CCCCCTT

should be fine.

Likewise, the deletion case should be pretty simple as the order shouldn't matter

@maggs-x
Copy link
Author

maggs-x commented Dec 10, 2024 via email

@Han-Cao
Copy link

Han-Cao commented Dec 10, 2024

Hi Maggs,

If you refer to the genotype:

REF    AGGGG
ALT    AAAGGGGTT

The allele "G GTT" cannot be left-aligned. If an indel is left aligned from pos 5 to pos 1, I think the insertion / deletion sequence must have the same motif for the repeat sequencing, like:

REF    ATTTT
ALT    AAATTTTTT

Then, the concatenated insertion A AAATT correctly describe the haplotype

@maggs-x
Copy link
Author

maggs-x commented Dec 10, 2024 via email

@Han-Cao
Copy link

Han-Cao commented Dec 10, 2024

Hi @glennhickey ,

There could be some complicated cases to handle when repeats are left-align to a multi-allelic non-repeat site.

For example, given reference sequence ATCGAAAAAA and the variants:

DEL1  1   ATCGAA   A
CPX1  1   ATCGAA   TTTT
INSx  4   G        <INSa>,<INSb>,...
DEL2  4   GAA      A
INS1  4   G        AAAA

If the A repeat is long enough in reference genome, DEL1 / CPX1 / INSx and DEL2 / INS1 are possible to exist on the same haplotype (maybe even common). It seems that concatenating such alleles is trying to partially convert the VCF back to the sequence of local haplotype, which may make the VCF sparser and less comparable among different callset, particularly when the sample size becomes larger. I think the major reason people use bcftools norm is to normalize variants and make indels more comparable.

I am considering only concatenating repeats with the same motif as only these variants can be left-aligned to the same position (please correct me if it is wrong), following the CCTT example:

  • Reference: A(CCTT)n
  • Variants:
ID POS REF ALT Motif Type Possible POS before left-align
1 1 A ACCTTCCTT CCTT Tandem repeat >=1
2 1 A ACCTT CCTT Tandem repeat >=1
3 1 A ACC CC Insertion 1
4 1 A T T SNP 1

We only concatenate variants 1 and 2 when there are 2 alleles on the same haplotype. If we have both 2 and 3, I personally prefer not to concatenate, which keep the CCCCTT insertion as 2 parts: CC insertion at POS 1 + CCTT repeat expansion at POS >= 1. This strategy seems to make variants more comparable according to this SV merging paper.

What do you think?

Update Dec 11: Now I think it is hard to decide which variants to concatenate without seeing real data. I will try to implement 3 methods first: concatenate variants with the same allele (now), repeat motif, or position.

@glennhickey
Copy link
Collaborator

Thanks. I take your point that complicated cases can create a mess, especially ones that incorporate snps. For the point about over-merging, I understand where you're coming from but I still think it's preferable to merge where possible then to have an invalid VCF (which is what many tools consider overlapping/conflicting records to be).

In the worse case, bcftools norm -m +any can force these cases into one site, but this is potentially lossy.

And I'll be happy to try out your various methods here. I'd really like to get this into the next HPRC release in one form or another, since the vcfwave-normalized VCF is one of the most widely used outputs...

@Han-Cao
Copy link

Han-Cao commented Dec 15, 2024

@glennhickey ,

I re-write the script and now it can process all variants with same POS. I have to say this is more difficult than I expected, so this post is very long... Besides, I added numpy as another dependency to simultaneously concat all variants in a matrix.

When I work on the new version, I realized that many examples I listed above cannot exist in a real VCF from MC pipeline, and my previous algorithm is totally wrong. If a VCF is valid before bcftools norm (i.e., no overlapping), the only situation we need to consider is to concat a variant A with additional indels, where the overlap is only due to left alignment. And what we should do is to right shift the indel (e.g., variant B) to the end of A's REF allele. The result is:

$$ Allele_{concat} = Allele_A + S_B[n:m] + S_B[0:n] $$

where $Allele_A$ is the REF or ALT allele of variant A to be concatenated, $S_B$ is the left-trimed indel sequence of variant B, $m = len(S_B)$, and $n = [len(REF_A) - 1] \mod m$. You can find a proof on why it is generalized here. Theoretically, this should also work for overlapping variant not with same POS (see below). But I don't have time to implement this yet. Do you think this feature would also help?

var  1  AATCAGGGGGGG  TCG
ins  5  A             AGGG

Importantly, the above algorithm requires the input VCF is:

  1. The input VCF should not have any overlapping alleles on the same haplotype before bcftools norm
  2. The input VCF should be sorted by only chr and position after bcftools norm.
    (Output of bcftools norm not always sorted due to left-align and we need to sort it. For bcftools after v1.7, it sort by chr, pos, ref, and alt, you may need sort -s -k1,1 -k2,2n to sort the VCF)

The first one guarantees the input VCF valid before left-align, so we can convert the invalid records by reverting left align. The second one allow us to sequentially concat overlapping indels (starting from the second variant) to the first one. I have tested on HPRC v1.1 chr20, and it always follow these 2 requirements before vcfwave, while after vcfwave, there are few redundant overlapping as shown below. The script will raise a warning and set the SNP genotype to 0 on specific haplotypes

chr20:642207:T_CCCAGCGGGGGT (trim long INS sequences)
chr20:642207:T_C (the above one is already a SNP + INS, so this can be ignored)

In addition to the HPRCv1.1 chr20, I also tested the script with below example. It looks good by checking manually. I will do more comprehensive test when I have time.
Input:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	hap1	hap2	hap3
chr1	5	var1	GGCT	AAA	.	.	.	GT	1	1	0
chr1	5	var2	G	GATGGGGTA	.	.	.	GT	0	0	1
chr1	5	del1	GGCTA	G	.	.	.	GT	1	0	1
chr1	5	concat1	GGCTAGCT	AAA	.	.	.	GT	0	0	0
chr1	5	ins1	G	GGCTA	.	.	.	GT	.	1	0
chr1	5	concat2	G	AAAA	.	.	.	GT	.	.	0
chr1	5	concat3	GGC	GATGGGG	.	.	.	GT	.	.	0

Output (this is reordered to make it align with input, the real output is sorted by chr, pos, ref, and alt for better visualization):

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	hap1	hap2	hap3
chr1	5	var1	GGCT	AAA	.	.	AC=0;AN=3;AF=0	GT	0	0	0
chr1	5	var2	G	GATGGGGTA	.	.	AC=0;AN=3;AF=0	GT	0	0	0
chr1	5	del1	GGCTA	G	.	.	AC=0;AN=3;AF=0	GT	0	0	0
chr1	5	concat1	GGCTAGCT	AAA	.	.	CONCAT=var1:del1;AC=1;AN=3;AF=0.333333	GT	1	0	0
chr1	5	ins1	G	GGCTA	.	.	AC=0;AN=2;AF=0	GT	.	0	0
chr1	5	concat2	G	AAAA	.	.	CONCAT=var1:ins1;AC=1;AN=2;AF=0.5	GT	.	1	0
chr1	5	concat3	GGC	GATGGGG	.	.	CONCAT=var2:del1;AC=1;AN=1;AF=1	GT	.	.	1

Newly added arguments:
--track: You can track how variants are concat (INFO/CONCAT) or which are duplicates INFO/DUP. It can track either ID or AT. --track AT will record AT in list of {var1_REF_AT}:{var2_REF_AT}_{var1_ALT_AT}:{var2_ALT_AT}, where : means concat. I this is better for VCF before vcfwave as most variants' ID are not unique
--concat: Concatenate variants when they have identical position (default) or repeat motif; none to skip

Finally, performance on chr20, it looks fast (~30s). As it now vectorize all haplotypes, processing more samples should not take a long time.
vcfbub + norm as input:

[2024-12-15 12:30:05] - [INFO]: Merge duplicated variants from: hprc-v1.1-mc.grch38.vcfbub.norm.vcf.gz
[2024-12-15 12:30:05] - [INFO]: Concatenate variants with same position
[2024-12-15 12:30:33] - [INFO]: Read 622693 variants
[2024-12-15 12:30:33] - [INFO]: 4375 variants are merged
[2024-12-15 12:30:33] - [INFO]: 8161 variants are concatenated
[2024-12-15 12:30:33] - [INFO]: Write 626479 variants to test.out.vcfbub.norm.position.vcf.gz

vcfbub + vcfwave + norm as input:

[2024-12-15 12:31:45] - [INFO]: Merge duplicated variants from: hprc-v1.1-mc.grch38.vcfbub.wave.norm.vcf.gz
[2024-12-15 12:31:45] - [INFO]: Concatenate variants with same position
[2024-12-15 12:31:45] - [WARNING]: Ignore redundant non-indel overlapping:
chr20:642207:T_CCCAGCGGGGGTG [trim long sequence]
chr20:642207:T_C
[20 more similar warnings, all with SNPs]
[2024-12-15 12:32:08] - [INFO]: Read 640289 variants
[2024-12-15 12:32:08] - [INFO]: 5363 variants are merged
[2024-12-15 12:32:08] - [INFO]: 9534 variants are concatenated
[2024-12-15 12:32:08] - [INFO]: Write 644460 variants to test.out.wave.norm.position.vcf.gz

@Han-Cao
Copy link

Han-Cao commented Dec 16, 2024

I just found an issue of the script when concat non-indel variants. For example, if a SNP concat with an indel, we may further left align the output:

HPRC example:

Input
chr20   195745  >40785190>40785193      C       T       60      .       AC=3;AF=0.0337079;AN=89;AT=>40785190>40785191>40785193,>40785190<40785192>40785193;NS=45;LV=0
chr20   195745  >40785193>40785203_4    CGTGT   C       60      .       AC=23;AF=0.258427;AN=89;AT=>40785193>40785194>40785195>40785203;NS=45;LV=0;ORIGIN=chr20:195772;LEN=4;TYPE=del

Output
chr20   195745  chr20:195745_0  CGTGT   T       60      .       AC=3;AF=0.0337079;AN=89;CONCAT=>40785190>40785193:>40785193>40785203_4

In the output, both REF and ALT end with T, so it can be further left align to chr20 195744 TCGTG T. However, in the origin VCF, this is a SNP + deletion, while after the second left align, it becomes a deletion. I think this could be misleading... What do you think?

In my test, merging variants with same repeat motif would not have this issue:

> bcftools norm -f GRCh38.fa -c e test.out.wave.norm.position.vcf.gz
Lines   total/split/joined/realigned/skipped:   644460/0/0/1867/0
> bcftools norm -f GRCh38.fa -c e test.out.wave.norm.repeat.vcf.gz
Lines   total/split/joined/realigned/skipped:   638415/0/0/0/0

@glennhickey
Copy link
Collaborator

Thanks @Han-Cao this looks brilliant. It's clearly a more complex problem than I'd thought -- I'll try out the latest version tomorrow.

@glennhickey
Copy link
Collaborator

@Han-Cao I've put a VCF here. It's made with vcfwave -> bcftools norm -f -> sort -k1,1d -k2,2n -s -> bcftools norm -m -any

The latest merge_duplicates.py crashes pretty much right away

raise ValueError(f"{var_add.chrom}:{var_add.pos}:{var_add.ref}_{var_add.alts[0]} " +  # type: ignore
ValueError: chr1:1285:TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA_T cannot be right shifted to concatenate with AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC.

Is this something you've seen!

@Han-Cao
Copy link

Han-Cao commented Jan 5, 2025

@glennhickey I have tried the VCF you share, the example you showed is easy to fix, but there might be other issues.

Error at chr1:1285

Your example is due to the VCF was not normalized after splitting into biallelic. If you only normalize the multi-allelic VCF, some variants are not fully normalized. For example, if you normalize GAAAA G,GAA, its output is still the same as input, because you cannot trim any bases from REF due to the existence of GAAAA G deletion.

In your example, the problem is that, for the following 2 variants harbored by sample HG02583,

chr1	1285	>70601934>70602061_1	TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC	TGACCCTGACCCTGACCCTGACCAGACCCAGACCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC	60	.	AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=119;TYPE=ins	GT	1|.
chr1	1285	>70601934>70602061_2	TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA	T	60	.	AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=112;TYPE=del	GT	1|.
  • Both REF and ALT of the first variant have AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC, if you right trim these bases, it is just an insertion
  • The second variant is a deletion of AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA, which delete N copies of AACCCT repeats and an extra A.

To concat these 2 variants, we need to right shift the second variant to the end of REF of the first variant. But because the first variant is not fully normalized, its REF allele conflict with the second variant. merge_duplicates.py has a function to check whether 2 variants are compatible, and it will raise the error message you saw when the check fails.

Once the biallelic VCF is normalized by bcftools norm -f, these 2 variants will be DEL + INS, and it can be concat into:

chr1	1285	chr1:1285_0	TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA	TGACCCTGACCCTGACCCTGACCAGACCCAGACCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTG

Additional errors

After I further normalize the VCF (with chm13v2.0), I found other errors. One example is these 2 variants on sample HG02155:

chr1	36699	>70623701>70623770	AGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCT	A
chr1	36783	>70623614>70623692	C	CGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGC

After left align, they both locate at chr1:36699, but I don't understand how these 2 variants exist on the same haplotype. I think one possible reason is that, before norm the multi-allelic VCF, >70623614>70623692 is at the upstream of >70623701>70623770 (seems possible based on their ID?), but only >70623701>70623770 was left aligned in the multi-allelic VCF. So, they don't have correct order after sorting. If we reverse the order of these 2 variants, then it is possible to concat an INS with a DEL.

If it is due to left align, the input VCF of merge_duplicates.py can only be sorted after normalizing the biallelic VCF, like vcfwave -> bcftools norm -m -any --site-win 0 -f -> sort -k1,1d -k2,2n -s. I just noticed bcftools norm can internally sort variants within a specific window, so we need to disable it by using --site-win 0.

Could you check the position of >70623701>70623770 and >70623614>70623692 before bcftools norm -f? If correctly normalize the VCF still have the error, may I have the VCF before left align? I didn't find such variant in my data.

Bug fix

I found bcftools norm with chm13v2.0 may output alleles in lower case. I just updated merge_duplicates.py to make it convert all alleles to upper case. Please try the latest version if any allele in your final VCF is in lower case.

@maggs-x
Copy link
Author

maggs-x commented Jan 6, 2025 via email

@maggs-x
Copy link
Author

maggs-x commented Jan 6, 2025 via email

@Han-Cao
Copy link

Han-Cao commented Jan 6, 2025

Maggs,

If you just want to visualize your data, normalization may not help. And I don't think the purpose of VCF normalization is to make the VCF more readable to human.

For me, I use the normalized VCF to compare VCFs generated from different sequencing platform or variant-calling method. You can refer to the vt paper to see why normalization is important. For complex sites (frequent in pangenome), normalization and "merge" duplicates are also important to compare the same variant between different haplotypes within the same VCF. For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF, but all of them are biologically the same variant. Leaving these variants unnormalized can cause misleading results for downstream analysis, such as SV merging, LD, GWAS.

And the difficulty we are currently trying to resolve is local haplotype reconstruction. This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF. Non-overlapping sites are necessary for many downstream tools (e.g. pangenie). So, I think what we are doing is to make the VCF more "readable" to algorithm, not human.

Besides, many problems I described only becomes serious when there are a lot of haplotypes. If you just have 3 haplotypes in your dataset, I think the raw VCF could be good enough for analysis.

@maggs-x
Copy link
Author

maggs-x commented Jan 6, 2025 via email

@maggs-x
Copy link
Author

maggs-x commented Jan 6, 2025 via email

@Han-Cao
Copy link

Han-Cao commented Jan 6, 2025

non-overlapping is not resolved by choosing only top-level variants after running vcfbub?

After VCF normalization, there are overlap / duplicates around repeat region. This is a limitation of bcftools norm that merge_duplicates.py try to fix. So, this pipeline is only needed if you want to have a normalized and non-overlapping / deduplicated VCF.

Besides, what I posted here are mainly based on my own experience and requirements on VCF. This workflow may not 100% fit your analysis.

@maggs-x
Copy link
Author

maggs-x commented Jan 6, 2025 via email

@glennhickey
Copy link
Collaborator

Thanks so much @Han-Cao . I had no idea bcftools norm behaved differently on the split alleles. Anyway, I will correct this and stat again with the latest merge_duplicates.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants