Skip to content

Latest commit

 

History

History
102 lines (66 loc) · 6.78 KB

README.md

File metadata and controls

102 lines (66 loc) · 6.78 KB

VCFIXER

New updated version of dataset_fixer (discontinued), now it does not rely in any other program or library. Checks vcf files and removes loci and samples with too many missing genotypes, it can also input genotypes in missing - various options for this.

It works smoothly with VCF files generated by populations Stacks: populations.snps.vcf

  • I do not own or am related in any way to Stacks people. Is a good useful pipeline for SNP calling
  • When I say smoothly I mean in our lab server.
  • Included some options in order to adjust it to other VCF file formats, so it should work fine with most of them.

May be smoothly is an overstatement, IT WILL TAKE LONG TIME to run, specially in order to analyse huge files, run it on a server, still pending to implement multi-thread. This was done long time ago, eventually this will be moved to python and the code should be transformed in a few functions that can be called instead of repeating so many lines.

Some population based analysis could fail if the sample names in the vcf file do not include the population names in their codename: POP1_001; or popA002; or CoolPlace042; etc. You may solve that by using a popmap (file with one sample per row, first column with sample name, second column with population name). There is a popmap_maker in other of my repositories.

vcfixer has many option and flags that can be used to change its behaviour or to adapt it to your file format.

To see the full information, usage, arguments, and settings just use any of the usual help flags (-h --h help -help --help).

vcfixer.pl --help

       vcfixer_1.5.pl   Help Information 30.VII.2020
    --------------------------------------------------

    This program will read a VCF file, first will delete empty or monomorphic SNP, then empty samples,
    then will delete loci with high missing rate, then samples with high missing rate,
    and finally  will input genotypes in the missing values left "./.".
    It is designed to work with the VCF file generated by the program "populations" (Stacks).
    Populations can be read either from sample names - saving the first x characters from each individual ID
    or from a popmap file - Stacks format: tab separated, one sample and one population per row.
    A log file, a list with deleted and kept SNPs, deleted and kept samples, and a new popmap will be saved in the output directory


    Command line arguments and defaults:

    --input / --vcf           name (or path) of the VCF file. Default: populations.snps.vcf

    --infocols                number of locus information columns before the first sample column. Default: 9

    --poplength               [int] how many characters of the sample name belong to the population code? Default: 2
    --popmap                  Alternatively provide a popmap to read which sample belongs to which population. No default

    --empty                   [float] missing rate from which a sample will be considered "empty" and deleted. Default: 0.8

    --miss_loci               [float] missing rate from which a loci should be deleted. Default: 0.3

    --miss_samples            [float] missing rate from which a sample should be deleted. Default: 0.3

    --minpop                  [int] minimum number of samples a population must have in order to keep it. Default: 8

    --gral_miss               How to replace the regular missing values? There are four options. Default: pop
                              "pop" to replace it with the population mode* (most frequent genotype)
                              "global" to replace the missing with the whole dataset mode*.
                              "miss" to leave it as missing. "2/2" or any other value to input that value.

    --pop_miss                What to input if a SNP is missing in an entire population? Three options available. Default: global
                              "global" to input the global mode*, "miss" to keep them as missing,
                              "2/2", or "5/5", or any other value: to input a new genotype and remark its difference from the rest.

    --noquality               [flag] add this if you do not want vcfixer_1.5.pl to input quality/probability information in missing
                              Some software cannot process genotypes without quality metadata, by default will be also inputed,
                              quality information is not editable from command line and should be handled carefully, values are:
                                    0/0:1:1,0:12:-0.05,-0.82,-1.42
                                    0/1:1:1,1:12:-1.01,-0.05,-1.01
                                    1/1:1:0,1:12:-1.42,-0.82,-0.05

    --summary                 path/name for a summary table that will gather some details of populations and vcf_fixer outputs.
                              if '--summary no' the file will not be created. Default: summary_table_vcf_fixer.txt

    --poplog                  [optional] path or name of the populations (Stacks) log file. Default: ref_map.log
                              if '--poplog no' it will not look for a logfile. Only used for summary table.
                              If no path (only name) provided will look for file in vcf file location.

    --out                     path or new name for the output file. By default will be "input name" + "tail" + ".vcf"

    --tail                    Something to the end of the input file name to generate the output name.
                              If no tail provided, will add to the file name the sample number, SNP number, and missing handling:
                               1p: regular missing replaced with population mode
                               1g: regular missing replaced with global mode
                               1m: regular missing not handled - left as missing.
                               1x: regular missing replaced with custome input
                               0m: if not regular missing found in the file
                               2g: when a SNP is missing in the whole population, the global mode is input
                               2m: when missing in the whole population is left as missing
                               2x: custom input
                               0p: there are not SNPs entirely missing in any population


    Command line call example:
    vcfixer.pl --input /usr/home/refmap/populations.snps.vcf --poplength 3 --gral_miss global --minpop 1 --summary no


            * When two or more alleles have the highest frequency, one of them will be picked randomly for each SNP