Skip to content

Simple scripts for counting the frequency of known DNA variants (SNPs/MNPs) in next generation sequencing (NGS) data

License

Notifications You must be signed in to change notification settings

jcomand/VariantCounting

Repository files navigation

* These simple linux/perl scripts accompany the article "Characterizing variants of unknown significance in rhodopsin: a functional genomics approach".  They were used to quantify the frequency of individual library variants in fastq files produced from NGS amplicon sequencing of non-barcoded pools of DNA variants.  They are released here for completeness but have not been rewritten for robust, generalized use (including: hardcoded configuration file name, hardcoded temporary file names, input files not checked for errors, and no tool provided to easily make new probe.txt files for your own data). 

* Context: At the time these scripts were written, no available tools would accurately quantify low frequency, complex changes in NGS data, such as multinucleotide polymorphisms (MNPs), and some of the variants in our libraries were MNPs (overlapping with SNPs!). Some packages also used fixed statistical filtering of variants and would not report on all of the variants in the data. (At this time, there may be better, full-featured packages for low-frequency SNP/MNP quantification in fastq or bam files, but we have not yet had the need to evaluate them recently.  We also have used alignment-based scripts for counting variants in .bam files but have yet not validated them.)

* To analyze test data:
	"sh sampleCommands.sh"

* Files:

-sampleCommands.sh
	Run this shell script to analyze the sample data.

-censor4EFGHIJ.pl
	This script is run as the first stage in processing. It censors low quality basecalls at the basepair level. Currently hardcoded to allow Illumina Q=36-41 (codes E through J; See https://en.wikipedia.org/wiki/FASTQ_format#Encoding)

-countscriptInputarray.pl
	This script is run as the second stage of processing.  It uses the censored fastq files to count the frequency of each regular expression listed in probelist.txt.

-probelist.txt
	List of regular expressions ("probes") that will be searched.  In this context, the probes contain sequences from a known library variant and the reverse complement (as our sequencing was not stranded).  For quality control, three sets of probes were constructed – the wildtype sequence to determine coverage, the mutant sequence to be quantified, and an alternate "noise" sequence which should not be present in the data but was used to evaluate the noise in the sequencing data.

-SamplePooledLibrary.fastq
	Sample sequencing data from a pooled library of rhodopsin variants

-SampleWtPlasmid.fastq
	Sample sequencing data from a wildtype plasmid (to evaluate the frequency of sequencing errors)

-sampleQsubCommands.txt
	Example of using qsub to submit batches of files for parallel processing.  R1 and R2 files are combined, gunzippped, and submitted to the two processing scrips.

-countall.txt
	Frequency results are accumulated in this file, and typically analyzed in another program.

-sample excel sheet to analyze output.xlsx (generated by MS Windows Excel application)
	A sample analysis based on the countall.out file generating by sampleCommands.sh.

*To try this on your own data (please read "Context" section first.):
-Make of list of SNPs/MNPs that you know are in the dataset. (This tool is not for discovering SNPS.) For each variant, make a "probe" (substring) containing the DNA sequence you want to detect.  Usually the SNP is near the center of the probe.
-Make corresponding probes for the wildtype sequence
-optionally make corresponding probes for a SNP in the same position that is not expected to be in the data (to evaluate noise)
-combine these probes into a list, take reverse complement, and make a file formatted like probes.txt
-gunzip your fastq files
-run the commands listed in sampleCommmands.sh, modifying the filenames of the fastq files as needed (or from sampleQsubCommands.txt if your system has the qsub job submission system and you are processing many samples)
-copy the contents of countall.out to the sample excel file for analysis and follow bolded instructions with that file

*Jason Comander 8-10-2018
minor changes 1-15-2019

About

Simple scripts for counting the frequency of known DNA variants (SNPs/MNPs) in next generation sequencing (NGS) data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages