version 1.0.2
Whole Genome Amplification using multiple displacement amplification (MDA) sometimes can introduce potential false concatemer sequences that can affect whole genome assembly assays. Here we propose a Concatemer detection tool for those WGA assays.
Figure. Impact of MDA-Generated Concatemers on the Genome Assembly. (A) Concatemers generated by template switching; (B) Graph representation of the effect of concatemers on genome assembly (bubble fragmentation effect). (Agyabeng-Dadzie et al. 2024)
It splits all reads in separate files to perform sliding windows with the user prefered size and the gap between these windows. For ONT amplified reads, we suggest windows >= 500bp with no overlaps (e.g. -w 500
). If the read is not able to generate more than one window (< 500bp in size in the 500bp window example) the read is classified as "short-read" and it is stored in the short.fasta/fastq
output file. Reads with more than two windows, will be classified as longer sequences and will have their fragment windows aligned (global aligment) with each other and if overlaps are found they are classified as putative concatemers. The longer sequences with no overlaps are classified as non-concatemers. A classification Table will be generated containing the read IDs, Classification, number of windows generated and number of alignments found (note: number of alignments generated are not equivalent to number of repeats/copies). Both fastq and fasta formats are supported. Default global alignment coverage is set to 0.7.
Requirements:
-
Python3.6 or higher
-
BioPython v1.83 (tested)
Easy install unisng conda/mamba
mamba create -n cadect -c bioconda -c conda-forge biopython
git clone https://github.com/rpbap/CADECT.git
conda activate cadect
python CADECT_7.py [OPTIONS] -R <Reads.fastq/fasta> -o <output_dir> -w <window size>
Flag description:
Required:
-R --reads fastq (or fasta) file with reads generated by WGA sequencing using ONT (required)
-o --output_dir Output directory name (required)
Options:
-w --window length of desired window sequences in bp (default = 500)
Output File | Description |
---|---|
classification_table.txt |
File statistics of the CADECT pipeline |
non_concatemers.fastq |
fastq/fasta file containing non-concatemeric reads |
putative_concatemers.fastq |
fastq/fasta file containing putative concatemeric reads |
short.fastq |
fastq/fasta file containing short reads |
progress.log |
Classification progress report |
Read ID | Classification | Num Windows | Num Overlaps |
---|---|---|---|
3e8417bd-1c3d-4209-a2bd-b443822a7c27 | short | 1 | 0 |
1f3c3a56-b6a5-49dc-b9c7-2267440e094d | short | 1 | 0 |
b7ec9679-37df-42b5-8b4e-00b6fa5fe504 | non_concatemers | 8 | 0 |
d159b5a3-ee3b-4cc4-92ad-1422bf7a5a28 | putative_concatemers | 24 | 6 |
159ffb63-2583-4a7d-88a5-639111d4fe99 | putative_concatemers | 26 | 27 |
6d5ce662-395e-4af2-a68c-37015af5913b | putative_concatemers | 18 | 38 |
c3974c91-cf3d-4a0e-b7bd-0688ec05ea33 | non_concatemers | 8 | 0 |
b8194fa6-aa7b-4017-bd55-5538b8f31039 | putative_concatemers | 28 | 84 |
a6b76c03-832a-47a1-bb80-0a57b862118a | putative_concatemers | 19 | 7 |
- The current version uses Bio.pairwise2 for the global alignment which has been deprecated in Biopython. We are currently working to update the global aligner to something like Bio.Align.PairwiseAligner in a future version. So if the message below appears in your run the pipeline, don't worry, it is still working (just a warning message).
...python3.12/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
warnings.warn(`
- Useful command line to get the global stats from the classification table:
cat classification_table.txt| cut -f 2| sort| uniq -c
Total number of reads | Cumulative read length | Processing time | OS tested |
---|---|---|---|
1,000 reads | 4,099,269 bp | ~109 seconds | MacOS Ventura |
40,000 reads | 47,837,224 bp | ~486 seconds | MacOS Ventura |
494,419 reads | 699,495,625 bp | ~4.3 hours (~15,788 seconds) | MacOS Ventura |
1,000 reads | 6,439,871 bp | ~1,106 seconds | Ubuntu 22.04 |
40,000 reads | 261,519,967 bp | ~13 hours (~48,614 seconds) | Ubuntu 22.04 |
Computer specs tested:
- OS: Ubuntu 22.04; MacOS Ventura 13.3.1
- Memory: 64GiB
- Processor: Intel Xeon(R) CPU @ 3.90GHz x 16; Apple M1 Max
We are working to get a multithread function to boost time, in the meanwhile, we are providing a fasta/fastq parser script under extras (split_input.py) to split your input file into subsets to make the user able to submit multiple jobs and boost the run time
- Agyabeng-Dadzie et al. (2024) "Evaluating the benefits and limits of multiple displacement amplification with whole-genome Oxford Nanopore Sequencing." bioRxiv.
- Rodrigo P. Baptista, PhD link