Output VCF file is empty #2

ldenti · 2018-11-12T15:14:24Z

Hello,
I tried to run vargeno (last commit, 00ee0f0) on the following data: reference, VCFs (provided by 1000genomes), and sample sequenced from NA12878 individual.

I ran some tests on different chromosomes but I always got an output VCF file that contains only the header. I tried to figure out the reasons of this and these are my hypotheses.

I think the main problem is in the index step. Indeed, in my various tests, I obtain

...
[BloomFilter constructBfFromVCF] bit vector: 0/1120000000
...

From here, I started digging in the code and I saw that there could be some problems on how you index the input reference. In more details:

when you read the VCF file and extract the chromosome name from each line, you add "chr" at its start:

vargeno/src/generate_bf.cc

Line 206 in 00ee0f0

if(chr_name[0] != 'c') chr_name = "chr" + chr_name;

but you don't do the same when you read the headers from the input FASTA file:

vargeno/src/generate_bf.cc

Line 47 in 00ee0f0

id = line.substr(1);

For instance, this is a problem if the headers in the FASTA file contain only the chromosome number. I think this problem affects also the way you store information in the .chrlens file of your index:

vargeno/src/qv.cc

Line 2345 in 00ee0f0

fprintf(chrlens, "%s %lu\n", ref.seqs[i].name, ref.seqs[i].size);
when a header in the FASTA file contains the unique identifier for the sequence and also additional information (such as: ">22 dna:chromosome chromosome:GRCh37:22:1:51304566:1"), you consider all the line as unique identifier:

vargeno/src/generate_bf.cc

Line 47 in 00ee0f0

id = line.substr(1);

but when you parse the VCF file, you consider as chromosome name only the unique identifier since you get the chromosome number from the first column of the VCF (this should not affect the .chrlens file since you use a different function to read the reference FASTA when you write the .chrlens file)

I tried to solve these two problems by changing some lines in your code but I'm not sure if what I've done is right and enough (I'll anyway open a pull request: it could be a good starting point for you). With my fixes, now the output is not empty anymore.

Moreover, I think that this behaviour occurs also if the input VCF contains the field GT specified in the header and the GT columns (as in the VCFs provided by the 1000genomes project). If I run vargeno index, I obtain:

...
SNP Dictionary               
Total k-mers:        0
Unambig k-mers:      0
Ambig unique k-mers: 0
Ambig total k-mers:  0
...

Currently, I solved this problem by removing out from the VCFs the line in the header and the ~2500 columns of the samples. Maybe you could find a better solution to this or maybe you can update the readme accordingly.

Thanks in advance!

Best,
Luca

The text was updated successfully, but these errors were encountered:

bbsunchen · 2019-02-10T01:39:41Z

Hi Luca, thank you for catching this.
For now, there are two chromosome naming standards: Genome Reference Consortium (e.g. GRCh37) vs University of California at Santa Cruz(e.g. hg19).

GRC standard does not contain "chr", and any VCF files generated based on GRC reference genome won't contain "chr".
VarGeno now requires the reference genome and VCF sharing the same chromosome name standard.

ldenti · 2019-03-14T16:08:05Z

Ok, I see... but the problem occurs exactly when the reference genome and the VCF share the same chromosome name standard. Indeed, both the VCF and the reference I linked in my previous message follow the GRC standard. The problem, if I understood correctly your code, is that you add "chr" before the reference of each VCF line but you don't do that when parsing the reference FASTA. It's like vargeno assumes that the VCF follows the GRC standard whereas the reference doesn't: the files in the test folder don't follow the same standard (the ID in the FASTA file contain "chr", the IDs in the VCF don't contain "chr").

Moreover, I think that there is also a (possible) problem when the reference genome contains a character different from A,C,G,T,N (ie. a character from the IUPAC code such as an M or an R, as happens in the reference genome I linked in the previous message). In such a context, vargeno crashes during the index step printing:

vargeno: src/util.c:122: kmer_t shift_kmer(kmer_t, char): Assertion `0' failed.
Aborted (core dumped)

Right now, I solved this by modifying the input FASTA.

ldenti mentioned this issue Nov 12, 2018

Fixes some problems related to FASTA header management #3

Open

ldenti mentioned this issue Mar 14, 2019

Segmentation fault during geno step #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output VCF file is empty #2

Output VCF file is empty #2

ldenti commented Nov 12, 2018

bbsunchen commented Feb 10, 2019 •

edited

Loading

ldenti commented Mar 14, 2019

Output VCF file is empty #2

Output VCF file is empty #2

Comments

ldenti commented Nov 12, 2018

bbsunchen commented Feb 10, 2019 • edited Loading

ldenti commented Mar 14, 2019

bbsunchen commented Feb 10, 2019 •

edited

Loading