Overlapping SVs within individuals #1557

maggs-x · 2024-12-05T03:06:35Z

Hi,

Thanks for your help with my previous question, Glenn. I have another question about a different pangenome I'm working on. We built it last year with cactus-minigraph pangenome pipeline (v2.5.1). The input for our pangenome are four chromosome level assemblies for four individuals, the fourth of which was used as the reference. So our vcf has three individuals (One, Two, and Three).

In the hal file and vcf we noticed structural insertions, for example, within individuals that overlap one another. Presumably, these should be represented by a single structural insertion. In more detail here is an example:

On chromosome 1 at site 671683 there is a 238bp insertion in individual One.
Then 8 base pairs away, at site 671691 there is a 270bp insertion in individual One.

I attached screenshots of the two variants in case this helps. I'm curious if you've run into this issue before. Please let me know.

maggs-x · 2024-12-05T05:15:41Z

Hi, please disregard my initial concern. It makes complete sense to have two structural insertions relatively close together when the coordinates are based on the reference. I just wanted to give you a heads up that when we normalized and left aligned this vcf with bcftools, all 5 structural insertions were reassigned to site 671683 making it seem like single individuals had multiple alleles at the same site. This doesn't make biological sense for our dataset given we did not input multiple haplotypes per individual. In case someone else runs into this issue, I figured its good to put on your radar.

glennhickey · 2024-12-05T17:21:13Z

Yeah, this sounds like #1493 and I suspect the problem is mostly in bcftools norm. There's a potential fix here #1536. I don't think it's ideal but likely better than nothing.

maggs-x · 2024-12-05T22:34:22Z

Thanks Glenn. And it makes complete sense that the CHR POS in the vcf after left-aligning doesn't match the hal file. Correct? [Reminder we ran cactus 2.5.1 so applied bcftools norm independently afterward].

I guess my concern with the merge_duplicates.py is that the appearance of the SV insertions as duplicate alleles is an artifact of bcftools norm. In the vcf after vcfbub, there are clearly 2 SV insertions as site 10 (let's say) and then 3 SV insertions at site 15. You can plot these clearly with a TubeMap. After bcftools norm, all 5 SV insertions are assigned to site 10. As I explained in the last comment, this doesn't make biological sense for our data. The merge_duplicates.py script looks like it can render a sensible looking vcf, but it'll condense all of these variants into a single variant per individual. I have my hesitations about this. Especially with plotting. For example, I can use the .vg file to create figures that align perfectly with the vcf after vcfbub. But it won't align with a vcf after bcftools norm + merge_duplicates.py. Sounds like we should weigh the costs and benefits of bcftools norm. Either analyze the vcf without any of this postprocessing, or postprocess and understand that there will be some inconsistencies between our plots and the vcf.

Thanks for your feedback. It really helped.

glennhickey · 2024-12-06T13:45:42Z

Yeah, these misgivings are why I hadn't merged #1536, but I've procrastinated following up.

Anyway, I agree. I think we need merge_duplicates.py to merge the alleles, not just the sites.

So if I have

A  AAAAA    0 1 1
A  AAAAA    0 1 0

My new site would be

A  AAAAA,AAAAAAAAA    0 2 1

Since the second sample had two equivalent insertions, they would just be doubled up. I think a similar process should work for indels in general...

Does this make sense? @Han-Cao what do you think?

maggs-x · 2024-12-06T14:27:06Z

Yes. Thank you. A tool like this would solve the problem we’re running into. If you make it, could you please make it applicable to phased and unphased datasets? I started fiddling with writing the code myself today but it’d take a while to get right. Thanks so much for your help Maggs X they/them

…

________________________________ From: Glenn Hickey ***@***.***> Sent: Saturday, December 7, 2024 12:46:05 AM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Author ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Yeah, these misgivings are why I hadn't merged #1536<#1536>, but I've procrastinated following up. Anyway, I agree. I think we need merge_duplicates.py to merge the alleles, not just the sites. So if I have A AAAAA 0 1 1 A AAAAA 0 1 0 My new site would be A AAAAA ,AAAAAAAAAA 0 2 1 Since the second sample had two equivalent insertions, they would just be doubled up. I think a similar process should work for indels in general... Does this make sense? @Han-Cao<https://github.com/Han-Cao> what do you think? — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLW45OJY3YOTEMEQJED2EGTB3AVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRTGI4DSMZYGQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

maggs-x · 2024-12-06T15:48:58Z

Here’s the other thing though. The left align and normalization also shifts the position far in some cases. For example, after bcftools norm there is an 8000bp insertion with a start site that is 140bp away from the original position. I’m thinking its related to the fact that bcftools norm is based on a single reference. I touched base with bcftools though to see if they could help with that. But just wanted to make sure you know. Thanks Maggs X they/them

…

________________________________ From: Maggs X ***@***.***> Sent: Saturday, December 7, 2024 1:27:01 AM To: ComparativeGenomicsToolkit/cactus ***@***.***>; ComparativeGenomicsToolkit/cactus ***@***.***> Cc: Author ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Yes. Thank you. A tool like this would solve the problem we’re running into. If you make it, could you please make it applicable to phased and unphased datasets? I started fiddling with writing the code myself today but it’d take a while to get right. Thanks so much for your help Maggs X they/them

________________________________ From: Glenn Hickey ***@***.***> Sent: Saturday, December 7, 2024 12:46:05 AM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Author ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Yeah, these misgivings are why I hadn't merged #1536<#1536>, but I've procrastinated following up. Anyway, I agree. I think we need merge_duplicates.py to merge the alleles, not just the sites. So if I have A AAAAA 0 1 1 A AAAAA 0 1 0 My new site would be A AAAAA ,AAAAAAAAAA 0 2 1 Since the second sample had two equivalent insertions, they would just be doubled up. I think a similar process should work for indels in general... Does this make sense? @Han-Cao<https://github.com/Han-Cao> what do you think? — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLW45OJY3YOTEMEQJED2EGTB3AVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRTGI4DSMZYGQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

Han-Cao · 2024-12-07T17:50:29Z

Hi @maggs-x ,

I would like to clarify that merge_duplicates.py will not merge the variants you showed because it always checks the genotypes before merging. If there are 2 alleles at the same POS on the same haplotype, they will not be merged. This is also why it now only supports phased VCF, because we cannot tell whether 2 heterozygous unphased genotypes are ok to merge. For example, the unphased genotypes for below VCFs are the same:

This will be merged:

A  AAAAA    1|0 1|0
A  AAAAA    0|1 0|0

This will not be merged:

A  AAAAA    1|0 1|0
A  AAAAA    1|0 0|0

@glennhickey ,

Yes, the behavior you describe is more reasonable. I think the new version of the script could do:
Input:

A  AAAAA    0 1 1
A  AAAAA    1 0 0
A  AAAAA    0 1 0

Output:

A  AAAAA,AAAAAAAAA    1 2 1

Will let you know when it is ready.

maggs-x · 2024-12-07T22:46:18Z

Thank you! Maggs X they/them

…

________________________________ From: Han Cao ***@***.***> Sent: Sunday, December 8, 2024 4:50:51 AM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Hi @maggs-x<https://github.com/maggs-x> , I would like to clarify that merge_duplicates.py will not merge the variants you showed because it always checks the genotypes before merging. If there are 2 alleles at the same POS on the same haplotype, they will not be merged. This is also why it now only supports phased VCF, because we cannot tell whether 2 heterozygous unphased genotypes are ok to merge. For example, the unphased genotypes for below VCFs are the same: This will be merged: A AAAAA 1|0 1|0 A AAAAA 0|1 0|0 This will not be merged: A AAAAA 1|0 1|0 A AAAAA 1|0 0|0 @glennhickey<https://github.com/glennhickey> , Yes, the behavior you describe is more reasonable. I think the new version of the script could do: Input: A AAAAA 0 1 1 A AAAAA 1 0 0 A AAAAA 0 1 0 Output: A AAAAA,AAAAAAAAA 1 2 1 Will let you know when it is ready. — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLUF3RMLSO3SLPHMWCT2EMYPXAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGI2TSOBTGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Han-Cao · 2024-12-08T15:37:28Z

Hi,

I have updated merge_duplicates.py. Please have a try.

For a small test VCF:

#CHROM	POS	ID	REF	ALT	Haplotype	Sample1	Sample2	Sample3
chr1	5	var1	A	AAA	0	0|0	1|0	0|1
chr1	5	var2	A	AAA	1	1|0	0|0	0|0
chr1	5	var3	A	AAA	.	0|0	1|0	0|0
chr1	5	var4	AAA	A	1	0|0	1|0	0|0
chr1	5	var5	AAA	A	1	0|0	1|1	0|1
chr1	6	var7	A	TTT	0	0|0	1|0	.|1
chr1	6	var8	A	TTT	0	0|0	1|0	1|0
chr1	6	var9	AA	TTTTTT	0	0|0	1|0	1|0

Output:

#CHROM	POS	ID	REF	ALT	Haplotype	Sample1	Sample2	Sample3
chr1	5	var1	A	AAA	1	1|0	0|0	0|1
chr1	5	var3	A	AAAAA	.	0|0	1|0	0|0
chr1	5	var4	AAA	A	0	0|0	0|1	0|1
chr1	5	var5	AAAAA	A	1	0|0	1|0	0|0
chr1	6	var7	A	TTT	0	0|0	0|0	1|1
chr1	6	var9	AA	TTTTTT	0	0|0	0|0	1|0
chr1	6	var8	AAAA	TTTTTTTTTTTT	0	0|0	1|0	.|0

var1-3: haplotype, sample 1 and 3 have 1 insertion on one haplotype, they merged into 1 record. Sample 2 has 2 alleles on hap1, its genotype and alleles are merged as shown in var3
var4-5: similar as above, but for a deletion
var7-9: a more complicated site, where sample 2 has a total of 4 copies of A -> TTT. The script will merge var7, 8, 9 recursively. I don't know if this exists in real dataset, just to show how the script works.

I am still not sure how to merge missing genotype with non-missing genotype. For example, if there are 2 duplicated A -> AAA:

when merging . and 0, we don't know whether there is 1 insertion (.), but it is impossible to have 2 insertions (0)
when merging . and 1, there is at least 1 insertion (1), but not sure if there are 2 insertions (.).

For samples without missing genotypes, having one insertion implies not having two insertions, and vice versa. So, it could be confusing to have any genotypes as missing. I added an option --merge-mis-as-ref to treat missing genotypes as reference when merging with non-missing genotypes. What do you think on merging missing genotypes?

Input:
A	AAA	0	1	1	.
A	AAA	.	.	1	.

Default output:
A	AAA	.	1	0	.
A	AAAAA	0	.	1	.

--merge-mis-as-ref
A	AAA	0	1	0	.
A	AAAAA	0	0	1	.

Finally, some evaluation when comparing with SNPs and small INDELs (LEN < 20) called from a linear reference genome (same as what I did in #1493). The new merging method slightly improves the performance.

	no merge	previous method	new method	--merge-mis-as-ref
Genotype concordance	0.6517	0.8509	0.8777	0.8776

glennhickey · 2024-12-09T17:35:37Z

Amazing @Han-Cao ! Thanks for your fast update. In my tests, I've been seeing that variants are only concatenated if they have identical alts. Would it be possible to generalize that a bit?

For example,

C CCC
C CCC

gets concatentated
but

C CCC
C CCCC

doesn't seem to. And the same thing seems to be the case for deletions.

Han-Cao · 2024-12-09T17:55:16Z

I was also thinking about this after testing the script. Merging repeats should not be difficult. But do you want to merge insertions with different sequences, like:

C CCC
C CTT

If the input VCF is sorted before left-align, would the first variant always be upstream to the second one? Can it be merged to C CCCTT? Complex variants with REF and ALT both longer than 1 base may be more complicated. Do you have any idea on how to merge them?

Anyway, I totally agree this feature is very useful. I am a bit busy this week, will find a time to work on it.

Update: I just realized that, if a variant can be left aligned, the variant I described may not exist, is it correct?

glennhickey · 2024-12-09T18:16:36Z

Right, I think left alignment ensures that ref/alt alleles are consistent substrings of each other, so that will hopefully avoid conflicts.

And I've been playing around with bcftools and it looks like the order is preserved. So if you have

C CCC
C CCCTT

Then I think concatenating in order

C CCCCCTT

should be fine.

Likewise, the deletion case should be pretty simple as the order shouldn't matter

maggs-x · 2024-12-10T04:24:52Z

Hi Han, thank you again. I'm sorry I don't have time to test out the code. My team decided to forgo using bcftools norm because it introduces more errors than we're comfortable with. Your code is definitely aiming at resolving one of the problems. Just keep in mind that if in the pangenome there are two variants within an individual, and those variants are in close proximity to each other (say at site 1, AAA and then at site 5 TTT), and after bcftools norm these get assigned to two overlapping variants (ie. site 1 AAA and site 1 TTT), then your approach will provide the genotype AAATTT. This is better than nothing, but there are an additional 4 base pairs in between that are shared with the reference. Let's call those GGGG. So, the real genotype should be AAAGGGGTTT. And I imagine you both are more aware, but just to be transparent I'm not sure if bcftools causes this error with small indels. We've only been looking at the structural variants so far. Maggs X they/them

…

________________________________ From: Han Cao ***@***.***> Sent: Monday, December 9, 2024 2:37 AM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Hi, I have updated merge_duplicates.py<https://github.com/Han-Cao/collapse-bubble/blob/master/scripts/merge_duplicates.py>. Please have a try. For a small test VCF: #CHROM POS ID REF ALT Haplotype Sample1 Sample2 Sample3 chr1 5 var1 A AAA 0 0|0 1|0 0|1 chr1 5 var2 A AAA 1 1|0 0|0 0|0 chr1 5 var3 A AAA . 0|0 1|0 0|0 chr1 5 var4 AAA A 1 0|0 1|0 0|0 chr1 5 var5 AAA A 1 0|0 1|1 0|1 chr1 6 var7 A TTT 0 0|0 1|0 .|1 chr1 6 var8 A TTT 0 0|0 1|0 1|0 chr1 6 var9 AA TTTTTT 0 0|0 1|0 1|0 Output: #CHROM POS ID REF ALT Haplotype Sample1 Sample2 Sample3 chr1 5 var1 A AAA 1 1|0 0|0 0|1 chr1 5 var3 A AAAAA . 0|0 1|0 0|0 chr1 5 var4 AAA A 0 0|0 0|1 0|1 chr1 5 var5 AAAAA A 1 0|0 1|0 0|0 chr1 6 var7 A TTT 0 0|0 0|0 1|1 chr1 6 var9 AA TTTTTT 0 0|0 0|0 1|0 chr1 6 var8 AAAA TTTTTTTTTTTT 0 0|0 1|0 .|0 * var1-3: haplotype, sample 1 and 3 have 1 insertion on one haplotype, they merged into 1 record. Sample 2 has 2 alleles on hap1, its genotype and alleles are merged as shown in var3 * var4-5: similar as above, but for a deletion * var7-9: a more complicated site, where sample 2 has a total of 4 copies of A -> TTT. The script will merge var7, 8, 9 recursively. I don't know if this exists in real dataset, just to show how the script works. I am still not sure how to merge missing genotype with non-missing genotype. For example, if there are 2 duplicated A -> AAA: * when merging . and 0, we don't know whether there is 1 insertion (.), but it is impossible to have 2 insertions (0) * when merging . and 1, there is at least 1 insertion (1), but not sure if there are 2 insertions (.). For samples without missing genotypes, having one insertion implies not having two insertions, and vice versa. So, it could be confusing to have any genotypes as missing. I added an option --merge-mis-as-ref to treat missing genotypes as reference when merging with non-missing genotypes. What do you think on merging missing genotypes? Input: A AAA 0 1 1 . A AAA . . 1 . Default output: A AAA . 1 0 . A AAAAA 0 . 1 .

--merge-mis-as-ref A AAA 0 1 0 . A AAAAA 0 0 1 . Finally, some evaluation when comparing with SNPs and small INDELs (LEN < 20) called from a linear reference genome (same as what I did in #1493<#1493>). The new merging method slightly improves the performance. no merge previous method new method --merge-mis-as-ref Genotype concordance 0.6517 0.8509 0.8777 0.8776 — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLXZPZCCEPM642LEFWL2ERRU3AVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRWGE4DMMZRGE>. You are receiving this because you were mentioned.

Han-Cao · 2024-12-10T05:13:58Z

Hi Maggs,

If you refer to the genotype:

REF    AGGGG
ALT    AAAGGGGTT

The allele "G GTT" cannot be left-aligned. If an indel is left aligned from pos 5 to pos 1, I think the insertion / deletion sequence must have the same motif for the repeat sequencing, like:

REF    ATTTT
ALT    AAATTTTTT

Then, the concatenated insertion A AAATT correctly describe the haplotype

maggs-x · 2024-12-10T06:40:42Z

Thanks Han. I understand. I looked back on the example I was concerned about. I was worried the extra base pairs between two large insertions weren't included in the genotypes after left alignment. Fortunately, they're there. So never mind about my earlier comment. I do still find it difficult for interpretation that different nodes in the pangenome graph end up getting assigned to the same start position after left alignment, but I imagine you've considered adding the node IDs of the combined dups to the ID column. Thanks again for your responsiveness. Much appreciated! Update: let me know if you want a vcf of the case I described to help troubleshoot. In this situation, stitching the two alleles directly together would be accurate. There are shared motifs between the two alleles, both are not identical, and no changes would be needed to the REF. Maggs X they/them

…

________________________________ From: Han Cao ***@***.***> Sent: Tuesday, December 10, 2024 4:14 PM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Hi Maggs, If you refer to the genotype: REF AGGGG ALT AAAGGGGTT The allele "G GTT" cannot be left-aligned. If an indel is left aligned from pos 5 to pos 1, I think the insertion / deletion sequence must have the same motif for the repeat sequencing, like: REF ATTTT ALT AAATTTTTT Then, the concatenated insertion A AAATT correctly describe the haplotype — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLVHG7S77V44OMW7DSD2EZ2CXAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZQGQYTSMJZHE>. You are receiving this because you were mentioned.

Han-Cao · 2024-12-10T06:51:57Z

Hi @glennhickey ,

There could be some complicated cases to handle when repeats are left-align to a multi-allelic non-repeat site.

For example, given reference sequence ATCGAAAAAA and the variants:

DEL1  1   ATCGAA   A
CPX1  1   ATCGAA   TTTT
INSx  4   G        <INSa>,<INSb>,...
DEL2  4   GAA      A
INS1  4   G        AAAA

If the A repeat is long enough in reference genome, DEL1 / CPX1 / INSx and DEL2 / INS1 are possible to exist on the same haplotype (maybe even common). It seems that concatenating such alleles is trying to partially convert the VCF back to the sequence of local haplotype, which may make the VCF sparser and less comparable among different callset, particularly when the sample size becomes larger. I think the major reason people use bcftools norm is to normalize variants and make indels more comparable.

I am considering only concatenating repeats with the same motif as only these variants can be left-aligned to the same position (please correct me if it is wrong), following the CCTT example:

Reference: A(CCTT)n
Variants:

ID	POS	REF	ALT	Motif	Type	Possible POS before left-align
1	1	A	ACCTTCCTT	CCTT	Tandem repeat	>=1
2	1	A	ACCTT	CCTT	Tandem repeat	>=1
3	1	A	ACC	CC	Insertion	1
4	1	A	T	T	SNP	1

We only concatenate variants 1 and 2 when there are 2 alleles on the same haplotype. If we have both 2 and 3, I personally prefer not to concatenate, which keep the CCCCTT insertion as 2 parts: CC insertion at POS 1 + CCTT repeat expansion at POS >= 1. This strategy seems to make variants more comparable according to this SV merging paper.

What do you think?

Update Dec 11: Now I think it is hard to decide which variants to concatenate without seeing real data. I will try to implement 3 methods first: concatenate variants with the same allele (now), repeat motif, or position.

glennhickey · 2024-12-11T15:17:22Z

Thanks. I take your point that complicated cases can create a mess, especially ones that incorporate snps. For the point about over-merging, I understand where you're coming from but I still think it's preferable to merge where possible then to have an invalid VCF (which is what many tools consider overlapping/conflicting records to be).

In the worse case, bcftools norm -m +any can force these cases into one site, but this is potentially lossy.

And I'll be happy to try out your various methods here. I'd really like to get this into the next HPRC release in one form or another, since the vcfwave-normalized VCF is one of the most widely used outputs...

Han-Cao · 2024-12-15T04:49:59Z

@glennhickey ,

I re-write the script and now it can process all variants with same POS. I have to say this is more difficult than I expected, so this post is very long... Besides, I added numpy as another dependency to simultaneously concat all variants in a matrix.

When I work on the new version, I realized that many examples I listed above cannot exist in a real VCF from MC pipeline, and my previous algorithm is totally wrong. If a VCF is valid before bcftools norm (i.e., no overlapping), the only situation we need to consider is to concat a variant A with additional indels, where the overlap is only due to left alignment. And what we should do is to right shift the indel (e.g., variant B) to the end of A's REF allele. The result is:

$$ Allele_{concat} = Allele_A + S_B[n:m] + S_B[0:n] $$

where $Allele_A$ is the REF or ALT allele of variant A to be concatenated, $S_B$ is the left-trimed indel sequence of variant B, $m = len(S_B)$, and $n = [len(REF_A) - 1] \mod m$. You can find a proof on why it is generalized here. Theoretically, this should also work for overlapping variant not with same POS (see below). But I don't have time to implement this yet. Do you think this feature would also help?

var  1  AATCAGGGGGGG  TCG
ins  5  A             AGGG

Importantly, the above algorithm requires the input VCF is:

The input VCF should not have any overlapping alleles on the same haplotype before bcftools norm
The input VCF should be sorted by only chr and position after bcftools norm.
(Output of bcftools norm not always sorted due to left-align and we need to sort it. For bcftools after v1.7, it sort by chr, pos, ref, and alt, you may need sort -s -k1,1 -k2,2n to sort the VCF)

The first one guarantees the input VCF valid before left-align, so we can convert the invalid records by reverting left align. The second one allow us to sequentially concat overlapping indels (starting from the second variant) to the first one. I have tested on HPRC v1.1 chr20, and it always follow these 2 requirements before vcfwave, while after vcfwave, there are few redundant overlapping as shown below. The script will raise a warning and set the SNP genotype to 0 on specific haplotypes

chr20:642207:T_CCCAGCGGGGGT (trim long INS sequences)
chr20:642207:T_C (the above one is already a SNP + INS, so this can be ignored)

In addition to the HPRCv1.1 chr20, I also tested the script with below example. It looks good by checking manually. I will do more comprehensive test when I have time.
Input:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	hap1	hap2	hap3
chr1	5	var1	GGCT	AAA	.	.	.	GT	1	1	0
chr1	5	var2	G	GATGGGGTA	.	.	.	GT	0	0	1
chr1	5	del1	GGCTA	G	.	.	.	GT	1	0	1
chr1	5	concat1	GGCTAGCT	AAA	.	.	.	GT	0	0	0
chr1	5	ins1	G	GGCTA	.	.	.	GT	.	1	0
chr1	5	concat2	G	AAAA	.	.	.	GT	.	.	0
chr1	5	concat3	GGC	GATGGGG	.	.	.	GT	.	.	0

Output (this is reordered to make it align with input, the real output is sorted by chr, pos, ref, and alt for better visualization):

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	hap1	hap2	hap3
chr1	5	var1	GGCT	AAA	.	.	AC=0;AN=3;AF=0	GT	0	0	0
chr1	5	var2	G	GATGGGGTA	.	.	AC=0;AN=3;AF=0	GT	0	0	0
chr1	5	del1	GGCTA	G	.	.	AC=0;AN=3;AF=0	GT	0	0	0
chr1	5	concat1	GGCTAGCT	AAA	.	.	CONCAT=var1:del1;AC=1;AN=3;AF=0.333333	GT	1	0	0
chr1	5	ins1	G	GGCTA	.	.	AC=0;AN=2;AF=0	GT	.	0	0
chr1	5	concat2	G	AAAA	.	.	CONCAT=var1:ins1;AC=1;AN=2;AF=0.5	GT	.	1	0
chr1	5	concat3	GGC	GATGGGG	.	.	CONCAT=var2:del1;AC=1;AN=1;AF=1	GT	.	.	1

Newly added arguments:
--track: You can track how variants are concat (INFO/CONCAT) or which are duplicates INFO/DUP. It can track either ID or AT. --track AT will record AT in list of {var1_REF_AT}:{var2_REF_AT}_{var1_ALT_AT}:{var2_ALT_AT}, where : means concat. I this is better for VCF before vcfwave as most variants' ID are not unique
--concat: Concatenate variants when they have identical position (default) or repeat motif; none to skip

Finally, performance on chr20, it looks fast (~30s). As it now vectorize all haplotypes, processing more samples should not take a long time.
vcfbub + norm as input:

[2024-12-15 12:30:05] - [INFO]: Merge duplicated variants from: hprc-v1.1-mc.grch38.vcfbub.norm.vcf.gz
[2024-12-15 12:30:05] - [INFO]: Concatenate variants with same position
[2024-12-15 12:30:33] - [INFO]: Read 622693 variants
[2024-12-15 12:30:33] - [INFO]: 4375 variants are merged
[2024-12-15 12:30:33] - [INFO]: 8161 variants are concatenated
[2024-12-15 12:30:33] - [INFO]: Write 626479 variants to test.out.vcfbub.norm.position.vcf.gz

vcfbub + vcfwave + norm as input:

[2024-12-15 12:31:45] - [INFO]: Merge duplicated variants from: hprc-v1.1-mc.grch38.vcfbub.wave.norm.vcf.gz
[2024-12-15 12:31:45] - [INFO]: Concatenate variants with same position
[2024-12-15 12:31:45] - [WARNING]: Ignore redundant non-indel overlapping:
chr20:642207:T_CCCAGCGGGGGTG [trim long sequence]
chr20:642207:T_C
[20 more similar warnings, all with SNPs]
[2024-12-15 12:32:08] - [INFO]: Read 640289 variants
[2024-12-15 12:32:08] - [INFO]: 5363 variants are merged
[2024-12-15 12:32:08] - [INFO]: 9534 variants are concatenated
[2024-12-15 12:32:08] - [INFO]: Write 644460 variants to test.out.wave.norm.position.vcf.gz

Han-Cao · 2024-12-16T07:00:00Z

I just found an issue of the script when concat non-indel variants. For example, if a SNP concat with an indel, we may further left align the output:

HPRC example:

Input
chr20   195745  >40785190>40785193      C       T       60      .       AC=3;AF=0.0337079;AN=89;AT=>40785190>40785191>40785193,>40785190<40785192>40785193;NS=45;LV=0
chr20   195745  >40785193>40785203_4    CGTGT   C       60      .       AC=23;AF=0.258427;AN=89;AT=>40785193>40785194>40785195>40785203;NS=45;LV=0;ORIGIN=chr20:195772;LEN=4;TYPE=del

Output
chr20   195745  chr20:195745_0  CGTGT   T       60      .       AC=3;AF=0.0337079;AN=89;CONCAT=>40785190>40785193:>40785193>40785203_4

In the output, both REF and ALT end with T, so it can be further left align to chr20 195744 TCGTG T. However, in the origin VCF, this is a SNP + deletion, while after the second left align, it becomes a deletion. I think this could be misleading... What do you think?

In my test, merging variants with same repeat motif would not have this issue:

> bcftools norm -f GRCh38.fa -c e test.out.wave.norm.position.vcf.gz
Lines   total/split/joined/realigned/skipped:   644460/0/0/1867/0
> bcftools norm -f GRCh38.fa -c e test.out.wave.norm.repeat.vcf.gz
Lines   total/split/joined/realigned/skipped:   638415/0/0/0/0

glennhickey · 2024-12-17T01:35:05Z

Thanks @Han-Cao this looks brilliant. It's clearly a more complex problem than I'd thought -- I'll try out the latest version tomorrow.

glennhickey · 2025-01-03T20:36:41Z

@Han-Cao I've put a VCF here. It's made with vcfwave -> bcftools norm -f -> sort -k1,1d -k2,2n -s -> bcftools norm -m -any

The latest merge_duplicates.py crashes pretty much right away

raise ValueError(f"{var_add.chrom}:{var_add.pos}:{var_add.ref}_{var_add.alts[0]} " +  # type: ignore
ValueError: chr1:1285:TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA_T cannot be right shifted to concatenate with AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC.

Is this something you've seen!

Han-Cao · 2025-01-05T17:49:52Z

@glennhickey I have tried the VCF you share, the example you showed is easy to fix, but there might be other issues.

Error at chr1:1285

Your example is due to the VCF was not normalized after splitting into biallelic. If you only normalize the multi-allelic VCF, some variants are not fully normalized. For example, if you normalize GAAAA G,GAA, its output is still the same as input, because you cannot trim any bases from REF due to the existence of GAAAA G deletion.

In your example, the problem is that, for the following 2 variants harbored by sample HG02583,

chr1	1285	>70601934>70602061_1	TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC	TGACCCTGACCCTGACCCTGACCAGACCCAGACCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC	60	.	AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=119;TYPE=ins	GT	1|.
chr1	1285	>70601934>70602061_2	TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA	T	60	.	AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=112;TYPE=del	GT	1|.

Both REF and ALT of the first variant have AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC, if you right trim these bases, it is just an insertion
The second variant is a deletion of AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA, which delete N copies of AACCCT repeats and an extra A.

To concat these 2 variants, we need to right shift the second variant to the end of REF of the first variant. But because the first variant is not fully normalized, its REF allele conflict with the second variant. merge_duplicates.py has a function to check whether 2 variants are compatible, and it will raise the error message you saw when the check fails.

Once the biallelic VCF is normalized by bcftools norm -f, these 2 variants will be DEL + INS, and it can be concat into:

chr1	1285	chr1:1285_0	TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA	TGACCCTGACCCTGACCCTGACCAGACCCAGACCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTG

Additional errors

After I further normalize the VCF (with chm13v2.0), I found other errors. One example is these 2 variants on sample HG02155:

chr1	36699	>70623701>70623770	AGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCT	A
chr1	36783	>70623614>70623692	C	CGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGC

After left align, they both locate at chr1:36699, but I don't understand how these 2 variants exist on the same haplotype. I think one possible reason is that, before norm the multi-allelic VCF, >70623614>70623692 is at the upstream of >70623701>70623770 (seems possible based on their ID?), but only >70623701>70623770 was left aligned in the multi-allelic VCF. So, they don't have correct order after sorting. If we reverse the order of these 2 variants, then it is possible to concat an INS with a DEL.

If it is due to left align, the input VCF of merge_duplicates.py can only be sorted after normalizing the biallelic VCF, like vcfwave -> bcftools norm -m -any --site-win 0 -f -> sort -k1,1d -k2,2n -s. I just noticed bcftools norm can internally sort variants within a specific window, so we need to disable it by using --site-win 0.

Could you check the position of >70623701>70623770 and >70623614>70623692 before bcftools norm -f? If correctly normalize the VCF still have the error, may I have the VCF before left align? I didn't find such variant in my data.

Bug fix

I found bcftools norm with chm13v2.0 may output alleles in lower case. I just updated merge_duplicates.py to make it convert all alleles to upper case. Please try the latest version if any allele in your final VCF is in lower case.

maggs-x · 2025-01-06T10:01:56Z

Thanks for looking at these specifics Glenn and Han. I’m curious if you’re both committed to normalizing the vcf? Everything looks great in my results without it. The question i ask myself is "is it reasonable to standardize every SV? Isn't that clouding the best estimate we have for the location of the SVs". (Yes). What do you think? The normalization of SVs causes more problems than it solves, perhaps. Minigraph cactus performs so well without it. If folks want things to be more human readable in a vcf, I understand. But it's as easy as plotting a Tube Map to look at complex regions. Pangenomes are complex and there are limits to how human readable they can be. Of course, it's wonderful if you solve all the problems. I appreciate the thought you're giving this. Thank you. Maggs X they/them

…

________________________________ From: Han Cao ***@***.***> Sent: Monday, January 6, 2025 4:50:15 AM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) @glennhickey<https://github.com/glennhickey> I have tried the VCF you share, the example you showed is easy to fix, but there might be other issues. Error at chr1:1285 Your example is due to the VCF was not normalized after splitting into biallelic. If you only normalize the multi-allelic VCF, some variants are not fully normalized. For example, if you normalize GAAAA G,GAA, its output is still the same as input, because you cannot trim any bases from REF due to the existence of GAAA G deletion. In your example, the problem is that, for the following 2 variants harbored by sample HG02583, chr1 1285 >70601934>70602061_1 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC TGACCCTGACCCTGACCCTGACCAGACCCAGACCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC 60 . AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=119;TYPE=ins GT 1|. chr1 1285 >70601934>70602061_2 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA T 60 . AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=112;TYPE=del GT 1|. * Both REF and ALT of the first variant have AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC, if you right trim these bases, it is just an insertion * The second variant is a deletion of AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA, which delete N copies of AACCCT repeats and an extra A. To concat these 2 variants, we need to right shift the second variant to the end of REF of the first variant. But because the first variant is not fully normalized, its REF allele conflict with the second variant. merge_duplicates.py has a function to check whether 2 variants are compatible, and it will raise the error message you saw when the check fails. Additional errors After I further normalize the VCF (with chm13v2.0), I found other errors. One example is these 2 variants on sample HG02155: chr1 36699 >70623701>70623770 AGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCT A chr1 36783 >70623614>70623692 C CGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGC After left align, they both locate at chr1:36699, but I don't understand how these 2 variants exist on the same haplotype. I think one possible reason is that, before norm the multi-allelic VCF, >70623614>70623692 is at the upstream of >70623701>70623770 (seems possible based on their ID?), but only >70623701>70623770 was left aligned in the multi-allelic VCF. So, they don't have correct order after sorting. If we reverse the order of these 2 variants, then it is possible to concat an INS with a DEL. If it is due to left align, the input VCF of merge_duplicates.py can only be sorted after normalizing the biallelic VCF, like vcfwave -> bcftools norm -m -any --site-win 0 -f -> sort -k1,1d -k2,2n -s. I just noticed bcftools norm can internally sort variants within a specific window, so we need to disable it by using --site-win 0. Could you check the position of >70623701>70623770 and >70623614>70623692 before bcftools norm -f? If correctly normalize the VCF still have the error, may I have the VCF before left align? I didn't find such variant in my data. Bug fix I found bcftools norm may output alleles in lower cases. I just updated merge_duplicates.py to make it perform case insensitive allele comparison. Please try the latest version if any allele in your VCF is in lower case. — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLRR6H2PZ3GJWIFWA432JFWFPAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZRG4YDEMZYGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

maggs-x · 2025-01-06T11:46:22Z

Do correct me if I’m wrong (of course). I forget the specifics I learned but wasn’t merge_duplicates at least partially developed as a fix for bcftools normalization? So, without it there’d be no need to merge_duplicates. If I’m correct, that’d make for an easier pipeline. Maggs X they/them

…

________________________________ From: Maggs X ***@***.***> Sent: Monday, January 6, 2025 9:01:47 PM To: ComparativeGenomicsToolkit/cactus ***@***.***>; ComparativeGenomicsToolkit/cactus ***@***.***> Cc: Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Thanks for looking at these specifics Glenn and Han. I’m curious if you’re both committed to normalizing the vcf? Everything looks great in my results without it. The question i ask myself is "is it reasonable to standardize every SV? Isn't that clouding the best estimate we have for the location of the SVs". (Yes). What do you think? The normalization of SVs causes more problems than it solves, perhaps. Minigraph cactus performs so well without it. If folks want things to be more human readable in a vcf, I understand. But it's as easy as plotting a Tube Map to look at complex regions. Pangenomes are complex and there are limits to how human readable they can be. Of course, it's wonderful if you solve all the problems. I appreciate the thought you're giving this. Thank you. Maggs X they/them

________________________________ From: Han Cao ***@***.***> Sent: Monday, January 6, 2025 4:50:15 AM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) @glennhickey<https://github.com/glennhickey> I have tried the VCF you share, the example you showed is easy to fix, but there might be other issues. Error at chr1:1285 Your example is due to the VCF was not normalized after splitting into biallelic. If you only normalize the multi-allelic VCF, some variants are not fully normalized. For example, if you normalize GAAAA G,GAA, its output is still the same as input, because you cannot trim any bases from REF due to the existence of GAAA G deletion. In your example, the problem is that, for the following 2 variants harbored by sample HG02583, chr1 1285 >70601934>70602061_1 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC TGACCCTGACCCTGACCCTGACCAGACCCAGACCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC 60 . AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=119;TYPE=ins GT 1|. chr1 1285 >70601934>70602061_2 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA T 60 . AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=112;TYPE=del GT 1|. * Both REF and ALT of the first variant have AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC, if you right trim these bases, it is just an insertion * The second variant is a deletion of AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA, which delete N copies of AACCCT repeats and an extra A. To concat these 2 variants, we need to right shift the second variant to the end of REF of the first variant. But because the first variant is not fully normalized, its REF allele conflict with the second variant. merge_duplicates.py has a function to check whether 2 variants are compatible, and it will raise the error message you saw when the check fails. Additional errors After I further normalize the VCF (with chm13v2.0), I found other errors. One example is these 2 variants on sample HG02155: chr1 36699 >70623701>70623770 AGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCT A chr1 36783 >70623614>70623692 C CGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGC After left align, they both locate at chr1:36699, but I don't understand how these 2 variants exist on the same haplotype. I think one possible reason is that, before norm the multi-allelic VCF, >70623614>70623692 is at the upstream of >70623701>70623770 (seems possible based on their ID?), but only >70623701>70623770 was left aligned in the multi-allelic VCF. So, they don't have correct order after sorting. If we reverse the order of these 2 variants, then it is possible to concat an INS with a DEL. If it is due to left align, the input VCF of merge_duplicates.py can only be sorted after normalizing the biallelic VCF, like vcfwave -> bcftools norm -m -any --site-win 0 -f -> sort -k1,1d -k2,2n -s. I just noticed bcftools norm can internally sort variants within a specific window, so we need to disable it by using --site-win 0. Could you check the position of >70623701>70623770 and >70623614>70623692 before bcftools norm -f? If correctly normalize the VCF still have the error, may I have the VCF before left align? I didn't find such variant in my data. Bug fix I found bcftools norm may output alleles in lower cases. I just updated merge_duplicates.py to make it perform case insensitive allele comparison. Please try the latest version if any allele in your VCF is in lower case. — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLRR6H2PZ3GJWIFWA432JFWFPAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZRG4YDEMZYGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Han-Cao · 2025-01-06T13:02:41Z

Maggs,

If you just want to visualize your data, normalization may not help. And I don't think the purpose of VCF normalization is to make the VCF more readable to human.

For me, I use the normalized VCF to compare VCFs generated from different sequencing platform or variant-calling method. You can refer to the vt paper to see why normalization is important. For complex sites (frequent in pangenome), normalization and "merge" duplicates are also important to compare the same variant between different haplotypes within the same VCF. For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF, but all of them are biologically the same variant. Leaving these variants unnormalized can cause misleading results for downstream analysis, such as SV merging, LD, GWAS.

And the difficulty we are currently trying to resolve is local haplotype reconstruction. This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF. Non-overlapping sites are necessary for many downstream tools (e.g. pangenie). So, I think what we are doing is to make the VCF more "readable" to algorithm, not human.

Besides, many problems I described only becomes serious when there are a lot of haplotypes. If you just have 3 haplotypes in your dataset, I think the raw VCF could be good enough for analysis.

maggs-x · 2025-01-06T14:25:17Z

Hi Han, Thanks. It's good to hear these problems aren't a big deal in a small pangenome. I appreciate it's important for normalization with different sequencing platforms, variant-calling method normalization is good, and that "readable" applies to both human readable and algorithmically readable. Thank you for working on it. Do you know much about how haplotype compressed genomes are impacted by the issues bcftools norm aims to solve? And, a deletion like this: "For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF"....so based on the mapping the position of the deletion is truly unknown? That means it's unknown in all the pangenome formats. In the vg, gfa and Hal file. "And the difficulty we are currently trying to resolve is local haplotype reconstruction." sorry if I interfered with your conversation. I care and am interested. Thank you. I have one more question. Here, "This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF."--- non-overlapping is not resolved by choosing only top-level variants after running vcfbub? Again, I appreciate your thought and feedback. Thank you! Maggs Maggs X they/them

…

________________________________ From: Han Cao ***@***.***> Sent: Tuesday, January 7, 2025 12:03:04 AM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Maggs, If you just want to visualize your data, normalization may not help. And I don't think the purpose of VCF normalization is to make the VCF more readable to human. For me, I use the normalized VCF to compare VCFs generated from different sequencing platform or variant-calling method. You can refer to the vt paper<https://doi.org/10.1093/bioinformatics/btv112> to see why normalization is important. For complex sites (frequent in pangenome), normalization and "merge" duplicates are also important to compare the same variant between different haplotypes within the same VCF. For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF, but all of them are biologically the same variant. Leaving these variants unnormalized can cause misleading results for downstream analysis, such as SV merging, LD, GWAS. And the difficulty we are currently trying to resolve is local haplotype reconstruction. This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF. Non-overlapping sites are necessary for many downstream tools (e.g. pangenie). So, I think what we are doing is to make the VCF more "readable" to algorithm, not human. Besides, many problems I described only becomes serious when there are a lot of haplotypes. If you just have 3 haplotypes in your dataset, I think the raw VCF could be good enough for analysis. — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLV2CPD4VZ4LCGRJ3RL2JJ5IRAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZTGA3DQMJXHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

maggs-x · 2025-01-06T14:59:50Z

Sorry. I think you meant something like 400bp A repeats, with a 100bp deletion. Then where does it land. Yes. Understandable to normalize for that. Maggs X they/them

…

________________________________ From: Maggs X ***@***.***> Sent: Tuesday, January 7, 2025 1:25:10 AM To: ComparativeGenomicsToolkit/cactus ***@***.***>; ComparativeGenomicsToolkit/cactus ***@***.***> Cc: Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Hi Han, Thanks. It's good to hear these problems aren't a big deal in a small pangenome. I appreciate it's important for normalization with different sequencing platforms, variant-calling method normalization is good, and that "readable" applies to both human readable and algorithmically readable. Thank you for working on it. Do you know much about how haplotype compressed genomes are impacted by the issues bcftools norm aims to solve? And, a deletion like this: "For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF"....so based on the mapping the position of the deletion is truly unknown? That means it's unknown in all the pangenome formats. In the vg, gfa and Hal file. "And the difficulty we are currently trying to resolve is local haplotype reconstruction." sorry if I interfered with your conversation. I care and am interested. Thank you. I have one more question. Here, "This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF."--- non-overlapping is not resolved by choosing only top-level variants after running vcfbub? Again, I appreciate your thought and feedback. Thank you! Maggs Maggs X they/them

________________________________ From: Han Cao ***@***.***> Sent: Tuesday, January 7, 2025 12:03:04 AM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) Maggs, If you just want to visualize your data, normalization may not help. And I don't think the purpose of VCF normalization is to make the VCF more readable to human. For me, I use the normalized VCF to compare VCFs generated from different sequencing platform or variant-calling method. You can refer to the vt paper<https://doi.org/10.1093/bioinformatics/btv112> to see why normalization is important. For complex sites (frequent in pangenome), normalization and "merge" duplicates are also important to compare the same variant between different haplotypes within the same VCF. For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF, but all of them are biologically the same variant. Leaving these variants unnormalized can cause misleading results for downstream analysis, such as SV merging, LD, GWAS. And the difficulty we are currently trying to resolve is local haplotype reconstruction. This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF. Non-overlapping sites are necessary for many downstream tools (e.g. pangenie). So, I think what we are doing is to make the VCF more "readable" to algorithm, not human. Besides, many problems I described only becomes serious when there are a lot of haplotypes. If you just have 3 haplotypes in your dataset, I think the raw VCF could be good enough for analysis. — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLV2CPD4VZ4LCGRJ3RL2JJ5IRAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZTGA3DQMJXHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Han-Cao · 2025-01-06T15:17:31Z

non-overlapping is not resolved by choosing only top-level variants after running vcfbub?

After VCF normalization, there are overlap / duplicates around repeat region. This is a limitation of bcftools norm that merge_duplicates.py try to fix. So, this pipeline is only needed if you want to have a normalized and non-overlapping / deduplicated VCF.

Besides, what I posted here are mainly based on my own experience and requirements on VCF. This workflow may not 100% fit your analysis.

maggs-x · 2025-01-06T15:35:01Z

I understand Han. Thanks, Maggs Maggs X they/them

…

________________________________ From: Han Cao ***@***.***> Sent: Tuesday, January 7, 2025 2:17:55 AM To: ComparativeGenomicsToolkit/cactus ***@***.***> Cc: maggs-x ***@***.***>; Mention ***@***.***> Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557) non-overlapping is not resolved by choosing only top-level variants after running vcfbub? After VCF normalization, there are overlap / duplicates around repeat region. This is a limitation of bcftools norm that merge_duplicates.py try to fix. So, this pipeline is only needed if you want to have a normalized and non-overlapping / deduplicated VCF. Besides, what I posted here are mainly based on my own experience and requirements on VCF. This workflow may not 100% fit your analysis. — Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLS6G6QZPAB6YQFWAZT2JKNCHAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZTGMZDKMRRGY>. You are receiving this because you were mentioned.Message ID: ***@***.***>

glennhickey · 2025-01-07T18:36:51Z

Thanks so much @Han-Cao . I had no idea bcftools norm behaved differently on the split alleles. Anyway, I will correct this and stat again with the latest merge_duplicates.py

maggs-x mentioned this issue Dec 10, 2024

strange handling of SVs by bcftools norm --fasta-ref samtools/bcftools#2330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlapping SVs within individuals #1557

Overlapping SVs within individuals #1557

maggs-x commented Dec 5, 2024 •

edited

Loading

maggs-x commented Dec 5, 2024

glennhickey commented Dec 5, 2024

maggs-x commented Dec 5, 2024

glennhickey commented Dec 6, 2024 •

edited

Loading

maggs-x commented Dec 6, 2024 via email

maggs-x commented Dec 6, 2024 via email

Han-Cao commented Dec 7, 2024

maggs-x commented Dec 7, 2024 via email

Han-Cao commented Dec 8, 2024

glennhickey commented Dec 9, 2024

Han-Cao commented Dec 9, 2024 •

edited

Loading

glennhickey commented Dec 9, 2024

maggs-x commented Dec 10, 2024 via email

Han-Cao commented Dec 10, 2024

maggs-x commented Dec 10, 2024 via email •

edited

Loading

Han-Cao commented Dec 10, 2024 •

edited

Loading

glennhickey commented Dec 11, 2024

Han-Cao commented Dec 15, 2024 •

edited

Loading

Han-Cao commented Dec 16, 2024

glennhickey commented Dec 17, 2024

glennhickey commented Jan 3, 2025

Han-Cao commented Jan 5, 2025 •

edited

Loading

maggs-x commented Jan 6, 2025 via email

maggs-x commented Jan 6, 2025 via email

Han-Cao commented Jan 6, 2025

maggs-x commented Jan 6, 2025 via email

maggs-x commented Jan 6, 2025 via email

Han-Cao commented Jan 6, 2025

maggs-x commented Jan 6, 2025 via email

glennhickey commented Jan 7, 2025

Overlapping SVs within individuals #1557

Overlapping SVs within individuals #1557

Comments

maggs-x commented Dec 5, 2024 • edited Loading

maggs-x commented Dec 5, 2024

glennhickey commented Dec 5, 2024

maggs-x commented Dec 5, 2024

glennhickey commented Dec 6, 2024 • edited Loading

maggs-x commented Dec 6, 2024 via email

maggs-x commented Dec 6, 2024 via email

Han-Cao commented Dec 7, 2024

maggs-x commented Dec 7, 2024 via email

Han-Cao commented Dec 8, 2024

glennhickey commented Dec 9, 2024

Han-Cao commented Dec 9, 2024 • edited Loading

glennhickey commented Dec 9, 2024

maggs-x commented Dec 10, 2024 via email

Han-Cao commented Dec 10, 2024

maggs-x commented Dec 10, 2024 via email • edited Loading

Han-Cao commented Dec 10, 2024 • edited Loading

glennhickey commented Dec 11, 2024

Han-Cao commented Dec 15, 2024 • edited Loading

Han-Cao commented Dec 16, 2024

glennhickey commented Dec 17, 2024

glennhickey commented Jan 3, 2025

Han-Cao commented Jan 5, 2025 • edited Loading

Error at chr1:1285

Additional errors

Bug fix

maggs-x commented Jan 6, 2025 via email

maggs-x commented Jan 6, 2025 via email

Han-Cao commented Jan 6, 2025

maggs-x commented Jan 6, 2025 via email

maggs-x commented Jan 6, 2025 via email

Han-Cao commented Jan 6, 2025

maggs-x commented Jan 6, 2025 via email

glennhickey commented Jan 7, 2025

maggs-x commented Dec 5, 2024 •

edited

Loading

glennhickey commented Dec 6, 2024 •

edited

Loading

Han-Cao commented Dec 9, 2024 •

edited

Loading

maggs-x commented Dec 10, 2024 via email •

edited

Loading

Han-Cao commented Dec 10, 2024 •

edited

Loading

Han-Cao commented Dec 15, 2024 •

edited

Loading

Han-Cao commented Jan 5, 2025 •

edited

Loading