Skip to content

sotuamax/cnv_Turticae

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gene copy number variation (CNV) in generalist herbivore of Tetranychus urticae (the two-spotted spider mite)

Table of Contents

Programs used / Dependencies

  • python 3+ (modules: pysam, pandas, numpy, Biopython, mpi4py)
  • R v4.1
  • Liftoff v1.6+

[NOTE] To enable parallel processing, python model mpi4py need to be installed.

Gene CNV estimation

This pipeline was developed for estimating CNV that focus on gene coding regions.
Input data:

  1. Reference genome in fasta format
  2. GTF annotation file for the reference genome (sorted and indexed)
  3. Illumina DNA-seq alignment in BAM format (sorted and indexed)

All the scripts under "gCNV" folder:
Step 1: count coverage on gene coding region (default stepsize 1 bp)

mpiexec -n 10 python gene_coverage.py [ref] [gtf] [bam] -step 5 -O [out]
# -n: core number 
# ref: reference genome in fasta file <br>
# gtf: gtf file for the reference genome <br>
# bam: bam of DNA-read aligned to reference genome <br>
# -step: stepsize assigned as 5 bp
# out: output folder

Step 2: estimate single-copy coverage depth for the BAM file

Rscript single_depth.R [out]/pos_depth.txt [cov_est] -O [out]
# cov_est: the start estimation value for BAM file coverage  <br>

Step 3: estimate gene CNV based on gene coding region coverage depth

mpiexec -n 10 python gene_CNV.py [out]/pos_depth.txt [out]/single_cov.txt -O [out]
# -n: core number 

Outputs:
pos_depth.txt: raw data for coverage at gene coding positions
single_cov.txt: single copy coverage value
histogram.pdf: coverage distribution using a histogram
gene_cnv.txt: gene CNV report file

To run all three steps together:

mpiexec -n 10 python gene_coverage.py [ref] [gtf] [bam] -step 5 -O [out] && Rscript single_depth.R [out]/pos_depth.txt [cov_est] -O [out] && mpiexec -n 10 python gene_CNV.py [out]/pos_depth.txt [out]/single_cov.txt -O [out]

Sequencing coverage for genomic regions of interest

To count coverage at a specific regions of interest:
Input data:

  1. Reference genome in fasta format
  2. Illumina DNA-seq alignment in BAM format (sorted and indexed)

Run the script "region_coverage.py":
python region_coverage.py [ref] [bam] -chr [chr] -range [start] [end] -step [size] -O [out]
# ref: reference genome in fasta file <br>
# bam: bam of DNA-read aligned to reference genome <br>
# chr: chromosome of interest <br>
# start: start position for coverage counting <br>
# end: end position for coverage counting <br>
# size: step size (default 100 bp) <br>
# out: output file (prefix) <br>

Genome scaffolding

Scaffold contigs to pseudo-chromosome level assembly using reference-based approach. Input data:

  1. contigs from assemblers in fasta format
  2. reference genome in fasta format
  3. GTF annotation file for the reference genome (sorted and indexed)

All the scripts under "scaffold" folder:
Step 1: generate GTF file for the contigs by aligning gene sequences from reference genome to the contigs using [Liftoff](https://github.com/agshumate/Liftoff)
liftoff [target] [ref] -g [gff] -o [out] -p 24
# target: contigs fasta file
# ref: reference fasta file
# -g: gff annotation for the reference genome
# -o: output name 
# -p:threads 

Step 2: parse GTF file from last step, and order and orientate contigs based on genes aligned from reference genome

python gff_parse.py [target_gff] [ref_gff] -vis vis_gff.R -O [out]
# target_gff: gff output from last step, as for the target contigs
# ref_gff: gff for the reference genome
# to visualize the contig to reference alignment, assign vis_gff.R
# -O: output file name (prefix)

Step 3: Scaffold contigs based on the parsed gff file

python scaffold_tig.py 

To caculate the N50 for contigs and scaffolds, run:

python N50.py [fasta]
# fasta file for which N50 value calculated on 

Amino acid at target-site

python target_site_allele.py [bam] [fof] -btype <DNA/RNA> -out [out]
# bam: bam file used for recovering <br>
# fof: formated target site position <br>
# btype: either be "DNA" or "RNA" <br>
# out: output file name <br>

For example about how to format "fof" the target-site position file, see folder "data"

About

copy number variation of T. urticae

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published