-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlapping SVs within individuals #1557
Comments
Hi, please disregard my initial concern. It makes complete sense to have two structural insertions relatively close together when the coordinates are based on the reference. I just wanted to give you a heads up that when we normalized and left aligned this vcf with bcftools, all 5 structural insertions were reassigned to site 671683 making it seem like single individuals had multiple alleles at the same site. This doesn't make biological sense for our dataset given we did not input multiple haplotypes per individual. In case someone else runs into this issue, I figured its good to put on your radar. |
Thanks Glenn. And it makes complete sense that the CHR POS in the vcf after left-aligning doesn't match the hal file. Correct? [Reminder we ran cactus 2.5.1 so applied bcftools norm independently afterward]. I guess my concern with the merge_duplicates.py is that the appearance of the SV insertions as duplicate alleles is an artifact of bcftools norm. In the vcf after vcfbub, there are clearly 2 SV insertions as site 10 (let's say) and then 3 SV insertions at site 15. You can plot these clearly with a TubeMap. After bcftools norm, all 5 SV insertions are assigned to site 10. As I explained in the last comment, this doesn't make biological sense for our data. The merge_duplicates.py script looks like it can render a sensible looking vcf, but it'll condense all of these variants into a single variant per individual. I have my hesitations about this. Especially with plotting. For example, I can use the .vg file to create figures that align perfectly with the vcf after vcfbub. But it won't align with a vcf after bcftools norm + merge_duplicates.py. Sounds like we should weigh the costs and benefits of bcftools norm. Either analyze the vcf without any of this postprocessing, or postprocess and understand that there will be some inconsistencies between our plots and the vcf. Thanks for your feedback. It really helped. |
Yeah, these misgivings are why I hadn't merged #1536, but I've procrastinated following up. Anyway, I agree. I think we need So if I have
My new site would be
Since the second sample had two equivalent insertions, they would just be doubled up. I think a similar process should work for indels in general... Does this make sense? @Han-Cao what do you think? |
Yes. Thank you. A tool like this would solve the problem we’re running into. If you make it, could you please make it applicable to phased and unphased datasets? I started fiddling with writing the code myself today but it’d take a while to get right. Thanks so much for your help
Maggs X
they/them
…________________________________
From: Glenn Hickey ***@***.***>
Sent: Saturday, December 7, 2024 12:46:05 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Author ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Yeah, these misgivings are why I hadn't merged #1536<#1536>, but I've procrastinated following up.
Anyway, I agree. I think we need merge_duplicates.py to merge the alleles, not just the sites.
So if I have
A AAAAA 0 1 1
A AAAAA 0 1 0
My new site would be
A AAAAA ,AAAAAAAAAA 0 2 1
Since the second sample had two equivalent insertions, they would just be doubled up. I think a similar process should work for indels in general...
Does this make sense? @Han-Cao<https://github.com/Han-Cao> what do you think?
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLW45OJY3YOTEMEQJED2EGTB3AVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRTGI4DSMZYGQ>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Here’s the other thing though. The left align and normalization also shifts the position far in some cases. For example, after bcftools norm there is an 8000bp insertion with a start site that is 140bp away from the original position. I’m thinking its related to the fact that bcftools norm is based on a single reference. I touched base with bcftools though to see if they could help with that. But just wanted to make sure you know. Thanks
Maggs X
they/them
…________________________________
From: Maggs X ***@***.***>
Sent: Saturday, December 7, 2024 1:27:01 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>; ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: Author ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Yes. Thank you. A tool like this would solve the problem we’re running into. If you make it, could you please make it applicable to phased and unphased datasets? I started fiddling with writing the code myself today but it’d take a while to get right. Thanks so much for your help
Maggs X
they/them
________________________________
From: Glenn Hickey ***@***.***>
Sent: Saturday, December 7, 2024 12:46:05 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Author ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Yeah, these misgivings are why I hadn't merged #1536<#1536>, but I've procrastinated following up.
Anyway, I agree. I think we need merge_duplicates.py to merge the alleles, not just the sites.
So if I have
A AAAAA 0 1 1
A AAAAA 0 1 0
My new site would be
A AAAAA ,AAAAAAAAAA 0 2 1
Since the second sample had two equivalent insertions, they would just be doubled up. I think a similar process should work for indels in general...
Does this make sense? @Han-Cao<https://github.com/Han-Cao> what do you think?
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLW45OJY3YOTEMEQJED2EGTB3AVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRTGI4DSMZYGQ>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Hi @maggs-x , I would like to clarify that This will be merged:
This will not be merged:
Yes, the behavior you describe is more reasonable. I think the new version of the script could do:
Output:
Will let you know when it is ready. |
Thank you!
Maggs X
they/them
…________________________________
From: Han Cao ***@***.***>
Sent: Sunday, December 8, 2024 4:50:51 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Hi @maggs-x<https://github.com/maggs-x> ,
I would like to clarify that merge_duplicates.py will not merge the variants you showed because it always checks the genotypes before merging. If there are 2 alleles at the same POS on the same haplotype, they will not be merged. This is also why it now only supports phased VCF, because we cannot tell whether 2 heterozygous unphased genotypes are ok to merge. For example, the unphased genotypes for below VCFs are the same:
This will be merged:
A AAAAA 1|0 1|0
A AAAAA 0|1 0|0
This will not be merged:
A AAAAA 1|0 1|0
A AAAAA 1|0 0|0
@glennhickey<https://github.com/glennhickey> ,
Yes, the behavior you describe is more reasonable. I think the new version of the script could do:
Input:
A AAAAA 0 1 1
A AAAAA 1 0 0
A AAAAA 0 1 0
Output:
A AAAAA,AAAAAAAAA 1 2 1
Will let you know when it is ready.
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLUF3RMLSO3SLPHMWCT2EMYPXAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGI2TSOBTGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi, I have updated merge_duplicates.py. Please have a try. For a small test VCF:
Output:
I am still not sure how to merge missing genotype with non-missing genotype. For example, if there are 2 duplicated A -> AAA:
For samples without missing genotypes, having one insertion implies not having two insertions, and vice versa. So, it could be confusing to have any genotypes as missing. I added an option
Finally, some evaluation when comparing with SNPs and small INDELs (LEN < 20) called from a linear reference genome (same as what I did in #1493). The new merging method slightly improves the performance.
|
Amazing @Han-Cao ! Thanks for your fast update. In my tests, I've been seeing that variants are only concatenated if they have identical alts. Would it be possible to generalize that a bit? For example,
gets concatentated
doesn't seem to. And the same thing seems to be the case for deletions. |
I was also thinking about this after testing the script. Merging repeats should not be difficult. But do you want to merge insertions with different sequences, like:
If the input VCF is sorted before left-align, would the first variant always be upstream to the second one? Can it be merged to Anyway, I totally agree this feature is very useful. I am a bit busy this week, will find a time to work on it. Update: I just realized that, if a variant can be left aligned, the variant I described may not exist, is it correct? |
Right, I think left alignment ensures that ref/alt alleles are consistent substrings of each other, so that will hopefully avoid conflicts. And I've been playing around with
Then I think concatenating in order
should be fine. Likewise, the deletion case should be pretty simple as the order shouldn't matter |
Hi Han, thank you again. I'm sorry I don't have time to test out the code. My team decided to forgo using bcftools norm because it introduces more errors than we're comfortable with. Your code is definitely aiming at resolving one of the problems. Just keep in mind that if in the pangenome there are two variants within an individual, and those variants are in close proximity to each other (say at site 1, AAA and then at site 5 TTT), and after bcftools norm these get assigned to two overlapping variants (ie. site 1 AAA and site 1 TTT), then your approach will provide the genotype AAATTT. This is better than nothing, but there are an additional 4 base pairs in between that are shared with the reference. Let's call those GGGG. So, the real genotype should be AAAGGGGTTT.
And I imagine you both are more aware, but just to be transparent I'm not sure if bcftools causes this error with small indels. We've only been looking at the structural variants so far.
Maggs X
they/them
…________________________________
From: Han Cao ***@***.***>
Sent: Monday, December 9, 2024 2:37 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Hi,
I have updated merge_duplicates.py<https://github.com/Han-Cao/collapse-bubble/blob/master/scripts/merge_duplicates.py>. Please have a try.
For a small test VCF:
#CHROM POS ID REF ALT Haplotype Sample1 Sample2 Sample3
chr1 5 var1 A AAA 0 0|0 1|0 0|1
chr1 5 var2 A AAA 1 1|0 0|0 0|0
chr1 5 var3 A AAA . 0|0 1|0 0|0
chr1 5 var4 AAA A 1 0|0 1|0 0|0
chr1 5 var5 AAA A 1 0|0 1|1 0|1
chr1 6 var7 A TTT 0 0|0 1|0 .|1
chr1 6 var8 A TTT 0 0|0 1|0 1|0
chr1 6 var9 AA TTTTTT 0 0|0 1|0 1|0
Output:
#CHROM POS ID REF ALT Haplotype Sample1 Sample2 Sample3
chr1 5 var1 A AAA 1 1|0 0|0 0|1
chr1 5 var3 A AAAAA . 0|0 1|0 0|0
chr1 5 var4 AAA A 0 0|0 0|1 0|1
chr1 5 var5 AAAAA A 1 0|0 1|0 0|0
chr1 6 var7 A TTT 0 0|0 0|0 1|1
chr1 6 var9 AA TTTTTT 0 0|0 0|0 1|0
chr1 6 var8 AAAA TTTTTTTTTTTT 0 0|0 1|0 .|0
* var1-3: haplotype, sample 1 and 3 have 1 insertion on one haplotype, they merged into 1 record. Sample 2 has 2 alleles on hap1, its genotype and alleles are merged as shown in var3
* var4-5: similar as above, but for a deletion
* var7-9: a more complicated site, where sample 2 has a total of 4 copies of A -> TTT. The script will merge var7, 8, 9 recursively. I don't know if this exists in real dataset, just to show how the script works.
I am still not sure how to merge missing genotype with non-missing genotype. For example, if there are 2 duplicated A -> AAA:
* when merging . and 0, we don't know whether there is 1 insertion (.), but it is impossible to have 2 insertions (0)
* when merging . and 1, there is at least 1 insertion (1), but not sure if there are 2 insertions (.).
For samples without missing genotypes, having one insertion implies not having two insertions, and vice versa. So, it could be confusing to have any genotypes as missing. I added an option --merge-mis-as-ref to treat missing genotypes as reference when merging with non-missing genotypes. What do you think on merging missing genotypes?
Input:
A AAA 0 1 1 .
A AAA . . 1 .
Default output:
A AAA . 1 0 .
A AAAAA 0 . 1 .
--merge-mis-as-ref
A AAA 0 1 0 .
A AAAAA 0 0 1 .
Finally, some evaluation when comparing with SNPs and small INDELs (LEN < 20) called from a linear reference genome (same as what I did in #1493<#1493>). The new merging method slightly improves the performance.
no merge previous method new method --merge-mis-as-ref
Genotype concordance 0.6517 0.8509 0.8777 0.8776
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLXZPZCCEPM642LEFWL2ERRU3AVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRWGE4DMMZRGE>.
You are receiving this because you were mentioned.
|
Hi Maggs, If you refer to the genotype:
The allele "G GTT" cannot be left-aligned. If an indel is left aligned from pos 5 to pos 1, I think the insertion / deletion sequence must have the same motif for the repeat sequencing, like:
Then, the concatenated insertion |
Thanks Han. I understand. I looked back on the example I was concerned about. I was worried the extra base pairs between two large insertions weren't included in the genotypes after left alignment. Fortunately, they're there. So never mind about my earlier comment.
I do still find it difficult for interpretation that different nodes in the pangenome graph end up getting assigned to the same start position after left alignment, but I imagine you've considered adding the node IDs of the combined dups to the ID column.
Thanks again for your responsiveness. Much appreciated!
Update: let me know if you want a vcf of the case I described to help troubleshoot. In this situation, stitching the two alleles directly together would be accurate. There are shared motifs between the two alleles, both are not identical, and no changes would be needed to the REF.
Maggs X
they/them
…________________________________
From: Han Cao ***@***.***>
Sent: Tuesday, December 10, 2024 4:14 PM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Hi Maggs,
If you refer to the genotype:
REF AGGGG
ALT AAAGGGGTT
The allele "G GTT" cannot be left-aligned. If an indel is left aligned from pos 5 to pos 1, I think the insertion / deletion sequence must have the same motif for the repeat sequencing, like:
REF ATTTT
ALT AAATTTTTT
Then, the concatenated insertion A AAATT correctly describe the haplotype
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLVHG7S77V44OMW7DSD2EZ2CXAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZQGQYTSMJZHE>.
You are receiving this because you were mentioned.
|
Hi @glennhickey , There could be some complicated cases to handle when repeats are left-align to a multi-allelic non-repeat site. For example, given reference sequence
If the A repeat is long enough in reference genome, DEL1 / CPX1 / INSx and DEL2 / INS1 are possible to exist on the same haplotype (maybe even common). It seems that concatenating such alleles is trying to partially convert the VCF back to the sequence of local haplotype, which may make the VCF sparser and less comparable among different callset, particularly when the sample size becomes larger. I think the major reason people use I am considering only concatenating repeats with the same motif as only these variants can be left-aligned to the same position (please correct me if it is wrong), following the
We only concatenate variants 1 and 2 when there are 2 alleles on the same haplotype. If we have both 2 and 3, I personally prefer not to concatenate, which keep the What do you think? Update Dec 11: Now I think it is hard to decide which variants to concatenate without seeing real data. I will try to implement 3 methods first: concatenate variants with the same allele (now), repeat motif, or position. |
Thanks. I take your point that complicated cases can create a mess, especially ones that incorporate snps. For the point about over-merging, I understand where you're coming from but I still think it's preferable to merge where possible then to have an invalid VCF (which is what many tools consider overlapping/conflicting records to be). In the worse case, And I'll be happy to try out your various methods here. I'd really like to get this into the next HPRC release in one form or another, since the vcfwave-normalized VCF is one of the most widely used outputs... |
I re-write the script and now it can process all variants with same POS. I have to say this is more difficult than I expected, so this post is very long... Besides, I added numpy as another dependency to simultaneously concat all variants in a matrix. When I work on the new version, I realized that many examples I listed above cannot exist in a real VCF from MC pipeline, and my previous algorithm is totally wrong. If a VCF is valid before where
Importantly, the above algorithm requires the input VCF is:
The first one guarantees the input VCF valid before left-align, so we can convert the invalid records by reverting left align. The second one allow us to sequentially concat overlapping indels (starting from the second variant) to the first one. I have tested on HPRC v1.1 chr20, and it always follow these 2 requirements before
In addition to the HPRCv1.1 chr20, I also tested the script with below example. It looks good by checking manually. I will do more comprehensive test when I have time.
Output (this is reordered to make it align with input, the real output is sorted by chr, pos, ref, and alt for better visualization):
Newly added arguments: Finally, performance on chr20, it looks fast (~30s). As it now vectorize all haplotypes, processing more samples should not take a long time.
vcfbub + vcfwave + norm as input:
|
I just found an issue of the script when concat non-indel variants. For example, if a SNP concat with an indel, we may further left align the output: HPRC example:
In the output, both REF and ALT end with T, so it can be further left align to In my test, merging variants with same repeat motif would not have this issue:
|
Thanks @Han-Cao this looks brilliant. It's clearly a more complex problem than I'd thought -- I'll try out the latest version tomorrow. |
@Han-Cao I've put a VCF here. It's made with The latest
Is this something you've seen! |
@glennhickey I have tried the VCF you share, the example you showed is easy to fix, but there might be other issues. Error at chr1:1285Your example is due to the VCF was not normalized after splitting into biallelic. If you only normalize the multi-allelic VCF, some variants are not fully normalized. For example, if you normalize In your example, the problem is that, for the following 2 variants harbored by sample HG02583,
To concat these 2 variants, we need to right shift the second variant to the end of REF of the first variant. But because the first variant is not fully normalized, its REF allele conflict with the second variant. Once the biallelic VCF is normalized by
Additional errorsAfter I further normalize the VCF (with chm13v2.0), I found other errors. One example is these 2 variants on sample
After left align, they both locate at If it is due to left align, the input VCF of Could you check the position of Bug fixI found |
Thanks for looking at these specifics Glenn and Han. I’m curious if you’re both committed to normalizing the vcf? Everything looks great in my results without it. The question i ask myself is "is it reasonable to standardize every SV? Isn't that clouding the best estimate we have for the location of the SVs". (Yes).
What do you think? The normalization of SVs causes more problems than it solves, perhaps. Minigraph cactus performs so well without it. If folks want things to be more human readable in a vcf, I understand. But it's as easy as plotting a Tube Map to look at complex regions. Pangenomes are complex and there are limits to how human readable they can be.
Of course, it's wonderful if you solve all the problems. I appreciate the thought you're giving this. Thank you.
Maggs X
they/them
…________________________________
From: Han Cao ***@***.***>
Sent: Monday, January 6, 2025 4:50:15 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
@glennhickey<https://github.com/glennhickey> I have tried the VCF you share, the example you showed is easy to fix, but there might be other issues.
Error at chr1:1285
Your example is due to the VCF was not normalized after splitting into biallelic. If you only normalize the multi-allelic VCF, some variants are not fully normalized. For example, if you normalize GAAAA G,GAA, its output is still the same as input, because you cannot trim any bases from REF due to the existence of GAAA G deletion.
In your example, the problem is that, for the following 2 variants harbored by sample HG02583,
chr1 1285 >70601934>70602061_1 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC TGACCCTGACCCTGACCCTGACCAGACCCAGACCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC 60 . AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=119;TYPE=ins GT 1|.
chr1 1285 >70601934>70602061_2 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA T 60 . AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=112;TYPE=del GT 1|.
* Both REF and ALT of the first variant have AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC, if you right trim these bases, it is just an insertion
* The second variant is a deletion of AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA, which delete N copies of AACCCT repeats and an extra A.
To concat these 2 variants, we need to right shift the second variant to the end of REF of the first variant. But because the first variant is not fully normalized, its REF allele conflict with the second variant. merge_duplicates.py has a function to check whether 2 variants are compatible, and it will raise the error message you saw when the check fails.
Additional errors
After I further normalize the VCF (with chm13v2.0), I found other errors. One example is these 2 variants on sample HG02155:
chr1 36699 >70623701>70623770 AGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCT A
chr1 36783 >70623614>70623692 C CGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGC
After left align, they both locate at chr1:36699, but I don't understand how these 2 variants exist on the same haplotype. I think one possible reason is that, before norm the multi-allelic VCF, >70623614>70623692 is at the upstream of >70623701>70623770 (seems possible based on their ID?), but only >70623701>70623770 was left aligned in the multi-allelic VCF. So, they don't have correct order after sorting. If we reverse the order of these 2 variants, then it is possible to concat an INS with a DEL.
If it is due to left align, the input VCF of merge_duplicates.py can only be sorted after normalizing the biallelic VCF, like vcfwave -> bcftools norm -m -any --site-win 0 -f -> sort -k1,1d -k2,2n -s. I just noticed bcftools norm can internally sort variants within a specific window, so we need to disable it by using --site-win 0.
Could you check the position of >70623701>70623770 and >70623614>70623692 before bcftools norm -f? If correctly normalize the VCF still have the error, may I have the VCF before left align? I didn't find such variant in my data.
Bug fix
I found bcftools norm may output alleles in lower cases. I just updated merge_duplicates.py to make it perform case insensitive allele comparison. Please try the latest version if any allele in your VCF is in lower case.
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLRR6H2PZ3GJWIFWA432JFWFPAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZRG4YDEMZYGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Do correct me if I’m wrong (of course). I forget the specifics I learned but wasn’t merge_duplicates at least partially developed as a fix for bcftools normalization? So, without it there’d be no need to merge_duplicates. If I’m correct, that’d make for an easier pipeline.
Maggs X
they/them
…________________________________
From: Maggs X ***@***.***>
Sent: Monday, January 6, 2025 9:01:47 PM
To: ComparativeGenomicsToolkit/cactus ***@***.***>; ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Thanks for looking at these specifics Glenn and Han. I’m curious if you’re both committed to normalizing the vcf? Everything looks great in my results without it. The question i ask myself is "is it reasonable to standardize every SV? Isn't that clouding the best estimate we have for the location of the SVs". (Yes).
What do you think? The normalization of SVs causes more problems than it solves, perhaps. Minigraph cactus performs so well without it. If folks want things to be more human readable in a vcf, I understand. But it's as easy as plotting a Tube Map to look at complex regions. Pangenomes are complex and there are limits to how human readable they can be.
Of course, it's wonderful if you solve all the problems. I appreciate the thought you're giving this. Thank you.
Maggs X
they/them
________________________________
From: Han Cao ***@***.***>
Sent: Monday, January 6, 2025 4:50:15 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
@glennhickey<https://github.com/glennhickey> I have tried the VCF you share, the example you showed is easy to fix, but there might be other issues.
Error at chr1:1285
Your example is due to the VCF was not normalized after splitting into biallelic. If you only normalize the multi-allelic VCF, some variants are not fully normalized. For example, if you normalize GAAAA G,GAA, its output is still the same as input, because you cannot trim any bases from REF due to the existence of GAAA G deletion.
In your example, the problem is that, for the following 2 variants harbored by sample HG02583,
chr1 1285 >70601934>70602061_1 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC TGACCCTGACCCTGACCCTGACCAGACCCAGACCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGACCCTGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC 60 . AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=119;TYPE=ins GT 1|.
chr1 1285 >70601934>70602061_2 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA T 60 . AC=1;AF=0.022727;AN=1;NS=43;LV=0;ORIGIN=chr1:1284;LEN=112;TYPE=del GT 1|.
* Both REF and ALT of the first variant have AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAAC, if you right trim these bases, it is just an insertion
* The second variant is a deletion of AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTA, which delete N copies of AACCCT repeats and an extra A.
To concat these 2 variants, we need to right shift the second variant to the end of REF of the first variant. But because the first variant is not fully normalized, its REF allele conflict with the second variant. merge_duplicates.py has a function to check whether 2 variants are compatible, and it will raise the error message you saw when the check fails.
Additional errors
After I further normalize the VCF (with chm13v2.0), I found other errors. One example is these 2 variants on sample HG02155:
chr1 36699 >70623701>70623770 AGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCT A
chr1 36783 >70623614>70623692 C CGGGTTCTCTGTGGCCAGCAGGCGGCGATGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCCGGCGCACGC
After left align, they both locate at chr1:36699, but I don't understand how these 2 variants exist on the same haplotype. I think one possible reason is that, before norm the multi-allelic VCF, >70623614>70623692 is at the upstream of >70623701>70623770 (seems possible based on their ID?), but only >70623701>70623770 was left aligned in the multi-allelic VCF. So, they don't have correct order after sorting. If we reverse the order of these 2 variants, then it is possible to concat an INS with a DEL.
If it is due to left align, the input VCF of merge_duplicates.py can only be sorted after normalizing the biallelic VCF, like vcfwave -> bcftools norm -m -any --site-win 0 -f -> sort -k1,1d -k2,2n -s. I just noticed bcftools norm can internally sort variants within a specific window, so we need to disable it by using --site-win 0.
Could you check the position of >70623701>70623770 and >70623614>70623692 before bcftools norm -f? If correctly normalize the VCF still have the error, may I have the VCF before left align? I didn't find such variant in my data.
Bug fix
I found bcftools norm may output alleles in lower cases. I just updated merge_duplicates.py to make it perform case insensitive allele comparison. Please try the latest version if any allele in your VCF is in lower case.
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLRR6H2PZ3GJWIFWA432JFWFPAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZRG4YDEMZYGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Maggs, If you just want to visualize your data, normalization may not help. And I don't think the purpose of VCF normalization is to make the VCF more readable to human. For me, I use the normalized VCF to compare VCFs generated from different sequencing platform or variant-calling method. You can refer to the vt paper to see why normalization is important. For complex sites (frequent in pangenome), normalization and "merge" duplicates are also important to compare the same variant between different haplotypes within the same VCF. For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF, but all of them are biologically the same variant. Leaving these variants unnormalized can cause misleading results for downstream analysis, such as SV merging, LD, GWAS. And the difficulty we are currently trying to resolve is local haplotype reconstruction. This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF. Non-overlapping sites are necessary for many downstream tools (e.g. pangenie). So, I think what we are doing is to make the VCF more "readable" to algorithm, not human. Besides, many problems I described only becomes serious when there are a lot of haplotypes. If you just have 3 haplotypes in your dataset, I think the raw VCF could be good enough for analysis. |
Hi Han,
Thanks. It's good to hear these problems aren't a big deal in a small pangenome. I appreciate it's important for normalization with different sequencing platforms, variant-calling method normalization is good, and that "readable" applies to both human readable and algorithmically readable. Thank you for working on it. Do you know much about how haplotype compressed genomes are impacted by the issues bcftools norm aims to solve?
And, a deletion like this: "For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF"....so based on the mapping the position of the deletion is truly unknown? That means it's unknown in all the pangenome formats. In the vg, gfa and Hal file.
"And the difficulty we are currently trying to resolve is local haplotype reconstruction." sorry if I interfered with your conversation. I care and am interested. Thank you. I have one more question.
Here, "This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF."--- non-overlapping is not resolved by choosing only top-level variants after running vcfbub?
Again, I appreciate your thought and feedback.
Thank you!
Maggs
Maggs X
they/them
…________________________________
From: Han Cao ***@***.***>
Sent: Tuesday, January 7, 2025 12:03:04 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Maggs,
If you just want to visualize your data, normalization may not help. And I don't think the purpose of VCF normalization is to make the VCF more readable to human.
For me, I use the normalized VCF to compare VCFs generated from different sequencing platform or variant-calling method. You can refer to the vt paper<https://doi.org/10.1093/bioinformatics/btv112> to see why normalization is important. For complex sites (frequent in pangenome), normalization and "merge" duplicates are also important to compare the same variant between different haplotypes within the same VCF. For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF, but all of them are biologically the same variant. Leaving these variants unnormalized can cause misleading results for downstream analysis, such as SV merging, LD, GWAS.
And the difficulty we are currently trying to resolve is local haplotype reconstruction. This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF. Non-overlapping sites are necessary for many downstream tools (e.g. pangenie). So, I think what we are doing is to make the VCF more "readable" to algorithm, not human.
Besides, many problems I described only becomes serious when there are a lot of haplotypes. If you just have 3 haplotypes in your dataset, I think the raw VCF could be good enough for analysis.
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLV2CPD4VZ4LCGRJ3RL2JJ5IRAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZTGA3DQMJXHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Sorry. I think you meant something like 400bp A repeats, with a 100bp deletion. Then where does it land. Yes. Understandable to normalize for that.
Maggs X
they/them
…________________________________
From: Maggs X ***@***.***>
Sent: Tuesday, January 7, 2025 1:25:10 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>; ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Hi Han,
Thanks. It's good to hear these problems aren't a big deal in a small pangenome. I appreciate it's important for normalization with different sequencing platforms, variant-calling method normalization is good, and that "readable" applies to both human readable and algorithmically readable. Thank you for working on it. Do you know much about how haplotype compressed genomes are impacted by the issues bcftools norm aims to solve?
And, a deletion like this: "For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF"....so based on the mapping the position of the deletion is truly unknown? That means it's unknown in all the pangenome formats. In the vg, gfa and Hal file.
"And the difficulty we are currently trying to resolve is local haplotype reconstruction." sorry if I interfered with your conversation. I care and am interested. Thank you. I have one more question.
Here, "This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF."--- non-overlapping is not resolved by choosing only top-level variants after running vcfbub?
Again, I appreciate your thought and feedback.
Thank you!
Maggs
Maggs X
they/them
________________________________
From: Han Cao ***@***.***>
Sent: Tuesday, January 7, 2025 12:03:04 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
Maggs,
If you just want to visualize your data, normalization may not help. And I don't think the purpose of VCF normalization is to make the VCF more readable to human.
For me, I use the normalized VCF to compare VCFs generated from different sequencing platform or variant-calling method. You can refer to the vt paper<https://doi.org/10.1093/bioinformatics/btv112> to see why normalization is important. For complex sites (frequent in pangenome), normalization and "merge" duplicates are also important to compare the same variant between different haplotypes within the same VCF. For example, if you have a deletion of "A" within a 100bp A repeats, it can have 100 different representations in the VCF, but all of them are biologically the same variant. Leaving these variants unnormalized can cause misleading results for downstream analysis, such as SV merging, LD, GWAS.
And the difficulty we are currently trying to resolve is local haplotype reconstruction. This aims to reconstruct a non-overlapping VCF from a decomposed and normalized VCF. Non-overlapping sites are necessary for many downstream tools (e.g. pangenie). So, I think what we are doing is to make the VCF more "readable" to algorithm, not human.
Besides, many problems I described only becomes serious when there are a lot of haplotypes. If you just have 3 haplotypes in your dataset, I think the raw VCF could be good enough for analysis.
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLV2CPD4VZ4LCGRJ3RL2JJ5IRAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZTGA3DQMJXHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
After VCF normalization, there are overlap / duplicates around repeat region. This is a limitation of Besides, what I posted here are mainly based on my own experience and requirements on VCF. This workflow may not 100% fit your analysis. |
I understand Han. Thanks,
Maggs
Maggs X
they/them
…________________________________
From: Han Cao ***@***.***>
Sent: Tuesday, January 7, 2025 2:17:55 AM
To: ComparativeGenomicsToolkit/cactus ***@***.***>
Cc: maggs-x ***@***.***>; Mention ***@***.***>
Subject: Re: [ComparativeGenomicsToolkit/cactus] Overlapping SVs within individuals (Issue #1557)
non-overlapping is not resolved by choosing only top-level variants after running vcfbub?
After VCF normalization, there are overlap / duplicates around repeat region. This is a limitation of bcftools norm that merge_duplicates.py try to fix. So, this pipeline is only needed if you want to have a normalized and non-overlapping / deduplicated VCF.
Besides, what I posted here are mainly based on my own experience and requirements on VCF. This workflow may not 100% fit your analysis.
—
Reply to this email directly, view it on GitHub<#1557 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLS6G6QZPAB6YQFWAZT2JKNCHAVCNFSM6AAAAABTBQAP2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZTGMZDKMRRGY>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks so much @Han-Cao . I had no idea |
Hi,
Thanks for your help with my previous question, Glenn. I have another question about a different pangenome I'm working on. We built it last year with cactus-minigraph pangenome pipeline (v2.5.1). The input for our pangenome are four chromosome level assemblies for four individuals, the fourth of which was used as the reference. So our vcf has three individuals (One, Two, and Three).
In the hal file and vcf we noticed structural insertions, for example, within individuals that overlap one another. Presumably, these should be represented by a single structural insertion. In more detail here is an example:
On chromosome 1 at site 671683 there is a 238bp insertion in individual One.
Then 8 base pairs away, at site 671691 there is a 270bp insertion in individual One.
I attached screenshots of the two variants in case this helps. I'm curious if you've run into this issue before. Please let me know.
The text was updated successfully, but these errors were encountered: