Gene copy number variation (CNV) in generalist herbivore of Tetranychus urticae (the two-spotted spider mite)

Programs used / Dependencies

python 3+ (modules: pysam, pandas, numpy, Biopython, mpi4py)
R v4.1
Liftoff v1.6+

[NOTE] To enable parallel processing, python model mpi4py need to be installed.

Gene CNV estimation

This pipeline was developed for estimating CNV that focus on gene coding regions.
Input data:

Reference genome in fasta format
GTF annotation file for the reference genome (sorted and indexed)
Illumina DNA-seq alignment in BAM format (sorted and indexed)

All the scripts under "gCNV" folder:
Step 1: count coverage on gene coding region (default stepsize 1 bp)

mpiexec -n 10 python gene_coverage.py [ref] [gtf] [bam] -step 5 -O [out]
# -n: core number 
# ref: reference genome in fasta file <br>
# gtf: gtf file for the reference genome <br>
# bam: bam of DNA-read aligned to reference genome <br>
# -step: stepsize assigned as 5 bp
# out: output folder

Step 2: estimate single-copy coverage depth for the BAM file

Rscript single_depth.R [out]/pos_depth.txt [cov_est] -O [out]
# cov_est: the start estimation value for BAM file coverage  <br>

Step 3: estimate gene CNV based on gene coding region coverage depth

mpiexec -n 10 python gene_CNV.py [out]/pos_depth.txt [out]/single_cov.txt -O [out]
# -n: core number

Outputs:
pos_depth.txt: raw data for coverage at gene coding positions
single_cov.txt: single copy coverage value
histogram.pdf: coverage distribution using a histogram
gene_cnv.txt: gene CNV report file

To run all three steps together:

mpiexec -n 10 python gene_coverage.py [ref] [gtf] [bam] -step 5 -O [out] && Rscript single_depth.R [out]/pos_depth.txt [cov_est] -O [out] && mpiexec -n 10 python gene_CNV.py [out]/pos_depth.txt [out]/single_cov.txt -O [out]

Sequencing coverage for genomic regions of interest

To count coverage at a specific regions of interest:
Input data:

Reference genome in fasta format
Illumina DNA-seq alignment in BAM format (sorted and indexed)

Run the script "region_coverage.py":

python region_coverage.py [ref] [bam] -chr [chr] -range [start] [end] -step [size] -O [out]
# ref: reference genome in fasta file <br>
# bam: bam of DNA-read aligned to reference genome <br>
# chr: chromosome of interest <br>
# start: start position for coverage counting <br>
# end: end position for coverage counting <br>
# size: step size (default 100 bp) <br>
# out: output file (prefix) <br>

Genome scaffolding

Scaffold contigs to pseudo-chromosome level assembly using reference-based approach. Input data:

contigs from assemblers in fasta format
reference genome in fasta format
GTF annotation file for the reference genome (sorted and indexed)

All the scripts under "scaffold" folder:
Step 1: generate GTF file for the contigs by aligning gene sequences from reference genome to the contigs using [Liftoff](https://github.com/agshumate/Liftoff)

liftoff [target] [ref] -g [gff] -o [out] -p 24
# target: contigs fasta file
# ref: reference fasta file
# -g: gff annotation for the reference genome
# -o: output name 
# -p:threads

Step 2: parse GTF file from last step, and order and orientate contigs based on genes aligned from reference genome

python gff_parse.py [target_gff] [ref_gff] -vis vis_gff.R -O [out]
# target_gff: gff output from last step, as for the target contigs
# ref_gff: gff for the reference genome
# to visualize the contig to reference alignment, assign vis_gff.R
# -O: output file name (prefix)

Step 3: Scaffold contigs based on the parsed gff file

python scaffold_tig.py

To caculate the N50 for contigs and scaffolds, run:

python N50.py [fasta]
# fasta file for which N50 value calculated on

Amino acid at target-site

python target_site_allele.py [bam] [fof] -btype <DNA/RNA> -out [out]
# bam: bam file used for recovering <br>
# fof: formated target site position <br>
# btype: either be "DNA" or "RNA" <br>
# out: output file name <br>

For example about how to format "fof" the target-site position file, see folder "data"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gene copy number variation (CNV) in generalist herbivore of Tetranychus urticae (the two-spotted spider mite)

Table of Contents

Programs used / Dependencies

Gene CNV estimation

Sequencing coverage for genomic regions of interest

Genome scaffolding

Amino acid at target-site

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
data		data
gCNV		gCNV
scaffold		scaffold
LICENSE		LICENSE
README.md		README.md
coverage.py		coverage.py
region_coverage.py		region_coverage.py
target_site_allele.py		target_site_allele.py

License

sotuamax/cnv_Turticae

Folders and files

Latest commit

History

Repository files navigation

Gene copy number variation (CNV) in generalist herbivore of Tetranychus urticae (the two-spotted spider mite)

Table of Contents

Programs used / Dependencies

Gene CNV estimation

Sequencing coverage for genomic regions of interest

Genome scaffolding

Amino acid at target-site

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages