Gene copy number variation (CNV) in generalist herbivore of Tetranychus urticae (the two-spotted spider mite)
- Gene CNV estimation
- Sequencing depth for genomic regions of interest
- Amino acid at target-site
- Genome scaffolding
- python 3+ (modules: pysam, pandas, numpy, Biopython, mpi4py)
- R v4.1
- Liftoff v1.6+
[NOTE] To enable parallel processing, python model mpi4py need to be installed.
This pipeline was developed for estimating CNV that focus on gene coding regions.
Input data:
- Reference genome in fasta format
- GTF annotation file for the reference genome (sorted and indexed)
- Illumina DNA-seq alignment in BAM format (sorted and indexed)
All the scripts under "gCNV" folder:
Step 1: count coverage on gene coding region (default stepsize 1 bp)
mpiexec -n 10 python gene_coverage.py [ref] [gtf] [bam] -step 5 -O [out]
# -n: core number
# ref: reference genome in fasta file <br>
# gtf: gtf file for the reference genome <br>
# bam: bam of DNA-read aligned to reference genome <br>
# -step: stepsize assigned as 5 bp
# out: output folder
Step 2: estimate single-copy coverage depth for the BAM file
Rscript single_depth.R [out]/pos_depth.txt [cov_est] -O [out]
# cov_est: the start estimation value for BAM file coverage <br>
Step 3: estimate gene CNV based on gene coding region coverage depth
mpiexec -n 10 python gene_CNV.py [out]/pos_depth.txt [out]/single_cov.txt -O [out]
# -n: core number
Outputs:
pos_depth.txt: raw data for coverage at gene coding positions
single_cov.txt: single copy coverage value
histogram.pdf: coverage distribution using a histogram
gene_cnv.txt: gene CNV report file
To run all three steps together:
mpiexec -n 10 python gene_coverage.py [ref] [gtf] [bam] -step 5 -O [out] && Rscript single_depth.R [out]/pos_depth.txt [cov_est] -O [out] && mpiexec -n 10 python gene_CNV.py [out]/pos_depth.txt [out]/single_cov.txt -O [out]
To count coverage at a specific regions of interest:
Input data:
- Reference genome in fasta format
- Illumina DNA-seq alignment in BAM format (sorted and indexed)
Run the script "region_coverage.py":
python region_coverage.py [ref] [bam] -chr [chr] -range [start] [end] -step [size] -O [out]
# ref: reference genome in fasta file <br>
# bam: bam of DNA-read aligned to reference genome <br>
# chr: chromosome of interest <br>
# start: start position for coverage counting <br>
# end: end position for coverage counting <br>
# size: step size (default 100 bp) <br>
# out: output file (prefix) <br>
Scaffold contigs to pseudo-chromosome level assembly using reference-based approach.
Input data:
- contigs from assemblers in fasta format
- reference genome in fasta format
- GTF annotation file for the reference genome (sorted and indexed)
All the scripts under "scaffold" folder:
Step 1: generate GTF file for the contigs by aligning gene sequences from reference genome to the contigs using [Liftoff](https://github.com/agshumate/Liftoff)
liftoff [target] [ref] -g [gff] -o [out] -p 24
# target: contigs fasta file
# ref: reference fasta file
# -g: gff annotation for the reference genome
# -o: output name
# -p:threads
Step 2: parse GTF file from last step, and order and orientate contigs based on genes aligned from reference genome
python gff_parse.py [target_gff] [ref_gff] -vis vis_gff.R -O [out]
# target_gff: gff output from last step, as for the target contigs
# ref_gff: gff for the reference genome
# to visualize the contig to reference alignment, assign vis_gff.R
# -O: output file name (prefix)
Step 3: Scaffold contigs based on the parsed gff file
python scaffold_tig.py
To caculate the N50 for contigs and scaffolds, run:
python N50.py [fasta]
# fasta file for which N50 value calculated on
python target_site_allele.py [bam] [fof] -btype <DNA/RNA> -out [out]
# bam: bam file used for recovering <br>
# fof: formated target site position <br>
# btype: either be "DNA" or "RNA" <br>
# out: output file name <br>
For example about how to format "fof" the target-site position file, see folder "data"