Skip to content

Latest commit

 

History

History
100 lines (65 loc) · 4.67 KB

README.md

File metadata and controls

100 lines (65 loc) · 4.67 KB

DiagLockPick

Updated 7-26-16 by Bradford Condon, Mark Farman Department of Plant Pathology University of Kentucky

Project overview

DiagLockPick is a tool for the rapid identification and amplification of diagnostic loci based on the whole genome sequence of known taxa.
The goal of this project was to take the guesswork out of designing diagnostic loci. It is not always feasible to completely sequence the genome of an unknown organism. Nevertheless, determining if the unknown organism belongs to a known clade can be crucial. With DiagLockPick, you can use the genomes of fully sequenced taxa and their known relationship to design a subset of loci that will, when sequenced, recreate that known relationship. Sequencing these loci in unkown taxa will all you to rapidly place it in context of your known taxa.

Usage

Dependencies**

Input

required

  • Set of input sequences
  • List mapping input sequences to clades
  • One strain designated as the output clade

optional

  • Minimum number of SNPs for a candidate locus to be considered (default: 1)
  • Loci size (default: 400bp)
  • Number of loci to include per set (default: 10)
  • Number of random loci combinations to try (default: 10000)
  • evalue cutoff for determining if loci are uniquely present
  • Primer design specifications

Output

  • Primer sets for amplification of the specified number of loci
  • Phylogenetic tree generated by the concatenation of those loci

Development

DiagLockPick was developed by Bradford Condon, University of Kentucky, a postdoctoral associate with Mark Farman, University of Kentucky, Department of Plant Pathology.

Please contact Mark Farman with questions and requests.

Generating the data

Part one: Sliding window

This component is no longer strictly necessary. However, it is still of interest. A sliding window perl script was written that goes along the reference and counts the number of SNPs present in each window. It takes, as an input, a SNP report table (such as the MUMMER snp report, or custom reports generated via Dr. Farman's unpublished program) and the scaffold info of the reference strain.

table<- read.table(file="list.txt", header=FALSE, sep = "\t")
colnames(table) <- c("Genome comparison", "clade assignment")
head(table[,1:2])

Part two: BLAST and BLAST filtering

Once candidate loci are defined in the previous step (either by sliding window, or by by regularly dividing the genome), they are BLASTed against all other strains.

  • Blast of candidate reference locus sequences against all strains

  • Filtering the BLAST report. At this step, several important considerations are

    • Loci need to be present in all strains. Therefore, a minimum % Identity, and a minimum length, is required.
    • Exclusion or inclusion of outlier strains from the above consideration. For our analysis, DSLiz was too much of an outlier, resulting in too many excluded loci.

    Once candidate loci are filtered, sequences from both the reference and every included genome are extracted and stored. These sequences will be aligned in the next step.

As of this writing, the algorithm randomly builds x sets of y (both user defined- default 10,000 and 10) loci. It may be advantageous to build trees for each locus individually, but doing so appears to be computationally unrealistic.

Part three: Alignment and inferring trees

FASTA sequences are aligned in MUSCLE, and trees are built in CLUSTAL (via command line). R integration could be possible, but assuming it can run server side, I don't think its necessary.

Part four: Phylogenetics

At this point, trees are loaded into R. Three main comparisons are made for each tree:

  • How close is each tree to the reference tree?
  • How well does each tree keep defined clades together (In-clade distance)?
  • How well does each tree separate defined clades (Out-clade distance)? In and out clade distances can also be considered for specific clade-clade comparisons (for example, Festuca-Lolium and triticum isolates)

The coding for reference distance still needs to be done. this is the APE plot version. I've also developed a ggplot based (ggtree) version, but there are many bugs with this version so I'm not relying on it at the moment.

Part Five: Primer design

After users select a suitable tree, primers are designed.

  • Retrieve the reference sequences based on the loci names (which include the scaffold, start, and end)
  • Run primer3