Felix Grünberger1, Robert Reichelt1, Ingrid Waege1, Verena Niedermaier1, Korbinian Bronner1, Korbinian Bronner1, Marcell Kaljanac2, Nina Weber1, Zubeir El Ahmad1, Lena Knauss1, Gregor Madej2, Christine Ziegler2, Dina Grohmann1, Winfried Hausner1°
1 Institute of Microbiology and Archaea Centre, University of Regensburg, Universitätsstraße 31, 93053 Regensburg, Germany
2 Department of Structural Biology, Institute of Biophysics and Physical Biochemistry, University of Regensburg, Universitätsstraße 31, 93053 Regensburg, Germany
° Corresponding author
This is the repository for the manuscript “CopR, a global regulator of transcription to maintain copper homeostasis in Pyrococcus furiosus”. It contains a description of the bioinformatical tools and custom Rscripts to process data and a description of the downstream analysis.
The repository is currently actively developed.
Quality control at the raw data level using
FastQC
was done to examine most common quality parameters (total number of
reads sequenced, GC content and the overall base quality score). Quality
filtering parameters were based on this analysis.
We used
trimmomatic
(v. 0.36) in single-end-mode to trim raw reads (Bolger, Lohse, and
Usadel (2014)).
We start trimming the remaining reads by certain quality scores (PHRED33 based, cut.off Phread score 20) with calculations in a sliding window of 4 bases and a minumum length of 12 bases (40 for ChIP-seq).
#!/bin/bash
for file in $data_folder/raw_data/*_rna.fastq
do
filename_extended=$file##*/
filename=$filename_extended%%.*
java -jar trimmomatic-0.36.jar SE -phred33 -threads 4 $file $data_folder"/clean_data/"$filename"_trimmed.fq.gz" LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:12
echo $filename "trimming finished"
done
After trimming, reads were mapped to the genome using STAR
(v.2.5.4)
(Dobin et al. 2013).
#!/bin/bash
# generate a index file for the genome based on .gtf file
./STAR.dms \
--runThreadN 4 \
--runMode genomeGenerate \
--genomeDir $data_folder"/genome_data/star" \
--genomeFastaFiles $data_folder"/genome_data/CP023154.fa" \
--sjdbGTFfile $data_folder"/genome_data/CP023154.gtf" \
--genomeSAindexNbases 10
# align reads
for file in $data_folder/clean_data/*.fq.gz
do
filename_extended=$file##*/
filename=$filename_extended%%.*
mkdir $data_folder"/mapped_data/star/"$filename
./STAR \
--runMode alignReads \
--runThreadN 4 \
--readFilesCommand gunzip -c \
--genomeDir $data_folder"/genome_data/star" \
--readFilesIn $file \
--outFileNamePrefix $data_folder"/mapped_data/star/"$filename/$filename"_" \
--outSAMtype BAM SortedByCoordinate \
--limitIObufferSize 300000000 \
--limitBAMsortRAM 4199333281
echo $filename "mapping finished"
done
After mapping files were converted to other formats (BigWig, wiggle)
that can be used used for viewing in a genome browser or for further
analysis.
Strand-specific wig and bigwig files were finally created using
bam2wig
(Version 1.5, https://github.com/MikeAxtell/bam2wig).
#!/bin/bash
for file in $data_folder/mapped_data/star/*/*.bam
do
filename_extended=$file##*/
filename=$filename_extended%%.*
./bam2wig $file
echo $filename "bam2wig finished"
done
Reads were mapped to the P. furiosus genome using Bowtie 2
(v.
2.2.3) with default settings (Langmead and Salzberg
2012).
#!/bin/bash
# Indexing a reference genome
cd $data_folder/mapped_data
bowtie2-build $data_folder/genome_data/CP023154.fasta CP023154
# Align reads
for file in $data_folder/trimmed_data/*.fastq.gz
do
filename_extended=${file##*/}
filename=${filename_extended%%.*}
(bowtie2 -x CP023154 -U $file -S $data_folder"/mapped_data/"$filename".sam") 2>$data_folder"/mapped_data/"$filename".sam".log
echo $filename mapping finished
done
SAM files were converted to sorted BAM files using samtools and extended towards the 3´direction to their fragment-size (200 bp size selection during library preparation) to better represent the precise protein-DNA interaction (Li et al. 2009).
#!/bin/bash
# sam to indexed sorted bam
for file in $data_folder/mapped_data/*.sam
do
filename_extended=$file##*/
filename=$filename_extended%%.*
samtools view -bS -q 40 $file > $data_folder"/mapped_data/"$filename."bam"
samtools sort $data_folder"/mapped_data/"$filename."bam" -o $data_folder"/mapped_data/"$filename."sorted.bam"
samtools index $data_folder"/mapped_data/"$filename."sorted.bam"
echo $filename sam_to_bam finished
done
# extend reads towards their 3´end
for file in $data_folder/mapped_data/*.sorted.bam
do
filename_extended=$file##*/
filename=$filename_extended%%.*
bamToBed -i $file | slopBed -i - -g $data_folder"/genome_data/chrom.sizes.txt" -s -r 150 -l 0 | bedToBam -i - -g $data_folder"/genome_data/chrom.sizes.txt" > $data_folder"/mapped_data/bedtools_extended/"$filename"_extended.bam"
echo $filename extension finished
done
# sort & index
for file in $data_folder/mapped_data/bedtools_extended/*_extended.bam
do
filename_extended=$file##*/
filename=$filename_extended%%.*
samtools sort $file -o $data_folder/mapped_data/bam_files/$filename".sorted.extended.bam"
samtools index $data_folder/mapped_data/bam_files/$filename".sorted.extended.bam"
done
# convert to bedgraph (for log2-calculations in R)
for file in $data_folder/mapped_data/bam_files/*.sorted.extended.bam
do
filename_extended=${file##*/}
filename=${filename_extended%%.*}
genomeCoverageBed -ibam $file -d > $data_folder"/mapped_data/bed_files/"$filename".sorted.extended.position.bedgraph"
echo $filename position specific genomecoverage finished
done
Position-specific enrichments in BED format were calculated by i)
scaling the reads in each dataset according to sequencing depth, ii)
calculation of the ratio between IP and input for each replicate, and
iii) averaging of the IP/input ratio from the biological triplicates and
taking the log2 for comparison (compare
normalise_chipseq
).
We performed differential expression analysis using DESeq2 (Love,
Anders, and Huber 2014). The analysis performed in R,
including preparation of count matrices, statistical testing and
exploratory data analysis can be retraced using the
deseq2_analysis
script.
The steps used to perform ChIP-seq enrichment analysis and integration
with RNA-seq data are shown in the
chipseq_downstream
script.
Raw sequence data have been uploaded to the NCBI sequence read archive (SRA) and are available under project accession number PRJNA603674.
This project is under the general MIT License - see the LICENSE file for details
Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. 2014. “Trimmomatic: A flexible trimmer for Illumina sequence data.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btu170.
Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2013. “STAR: Ultrafast universal RNA-seq aligner.” Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.
Langmead, Ben, and Steven L. Salzberg. 2012. “Fast gapped-read alignment with Bowtie 2.” Nature Methods. https://doi.org/10.1038/nmeth.1923.
Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009. “The Sequence Alignment/Map format and SAMtools.” Bioinformatics 25 (16): 2078–9. https://doi.org/10.1093/bioinformatics/btp352.
Love, M. I., Simon Anders, and Wolfgang Huber. 2014. Differential analysis of count data - the DESeq2 package. Vol. 15. 12. https://doi.org/110.1186/s13059-014-0550-8.