Merge pull request #4 from kauralasoo/CAGE

Build new promoter annotations based on CAGE peaks from FANTOM5
kauralasoo · Mar 10, 2022 · badb87f · badb87f
2 parents 6976042 + 81e3726
commit badb87f
Show file tree

Hide file tree

Showing 14 changed files with 94,112 additions and 98 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,12 +1,13 @@
-
 tests/data/cdss.rds
 tests/data/exons.rds
 tests/data/Homo_sapiens.GRCh37.74.TranscriptDb.db
 *.DS_Store
 tests/testthat/.Rhistory
-.Rhistory
+*.Rhistory
 .Rproj.user
 .Rbuildignore
 txrevise.Rproj
 scripts/.snakemake
 scripts/processed
+scripts/SlurmOut
+*.img
diff --git a/Dockerfile b/Dockerfile
@@ -1,6 +1,8 @@
-FROM bioconductor/bioconductor_docker:RELEASE_3_10
-LABEL authors="Kaur Alasoo" \
+FROM bioconductor/bioconductor_docker:RELEASE_3_11
+LABEL authors="Andreas Vija" \
       description="Docker image containing all requirements for txrevise R package and pipeline"
 
-RUN R -e "BiocManager::install(c('GenomicFeatures','devtools', 'optparse', 'rtracklayer','purrrlyr','readr', 'tidyr'))"
+RUN apt-get update && apt-get install -y libcurl4
+RUN R -e "install.packages(install.packages('XML', repos = 'http://www.omegahat.net/R'))"
+RUN R -e "BiocManager::install(c('GenomicFeatures', 'devtools', 'optparse', 'rtracklayer', 'purrrlyr', 'readr', 'tidyr', 'data.table', 'ggplot2'))"
 RUN R -e "devtools::install_github('kauralasoo/txrevise')"
diff --git a/README.md b/README.md
@@ -1,19 +1,44 @@
 # _txrevise_
-_txrevise_ R package provides utilites to pre-process Ensembl transcript annotations to quantify differences in transcript strucuture (alternative promoters, alternative splicing, alternative poly-adenylation) either between experimental conditions or genotypes (e.g. for transcript usage quntitative trait loci (tuQTL) mapping). 
+_txrevise_ R package provides utilites to pre-process Ensembl transcript annotations to quantify differences in transcript strucuture (alternative promoters, alternative splicing, alternative poly-adenylation) either between experimental conditions or genotypes (e.g. for transcript usage quntitative trait loci (tuQTL) mapping).
 
 If you use txrevise in your research, please cite the following paper: [Alasoo, Kaur, et al. "Genetic effects on promoter usage are highly context-specific and contribute to complex traits." Elife 8 (2019): e41673](https://doi.org/10.7554/eLife.41673)
 
 
+## Pre-computed event annotations
+Running _txrevise_ on the latest version of Ensembl can be quite timeconsuming. Thus, to make it easier to get started, we have pre-computed alternatve transcription events in the GFF3 format for both GRCh37 and GRCh38 reference genomes.
+
+We have constructed two types of events. In the main event files we have masked the alternative internal exons in the promoter and 3' end events (labeled upstream and downstream). This ensures that we can reliably distinguish promoter usage and 3' end usage from alternative splicing, which are likely to be driven by distinct molecular mechanisms. The main annotation files are:
+
+- [Homo_sapiens.GRCh37.87.version_1.tar.gz](https://zenodo.org/record/1302499/files/Homo_sapiens.GRCh37.87.version_1.tar.gz)
+- [Homo_sapiens.GRCh38.92.version_1.tar.gz](https://zenodo.org/record/1302499/files/Homo_sapiens.GRCh38.92.version_1.tar.gz)
+- [Homo_sapiens.GRCh38.96.version_1.tar.gz](https://zenodo.org/record/3232932/files/Homo_sapiens.GRCh38.96.version_1.tar.gz)
+
+We have also constructed an alternative set of "raw" annotation files where the alternative internal exons in promoter and 3' end events have not masked. This maximises QTL discovery, because we can now also detect splicing event near promoters and 3 ends, but it comes at the expense that we are no longer able to distinguish between different molecular mechanisms. The raw annotation files are:
+
+-   [Homo_sapiens.GRCh37.87.raw_events.version_1.tar.gz](https://zenodo.org/record/1302499/files/Homo_sapiens.GRCh37.87.raw_events.version_1.tar.gz)
+-   [Homo_sapiens.GRCh38.92.raw_events.version_1.tar.gz](https://zenodo.org/record/1302499/files/Homo_sapiens.GRCh38.92.raw_events.version_1.tar.gz)
+
 
 ## Constructing transcription events
-This section contains step-by-step instruction for how to construct transcriptional events based from Ensembl transcrtipt annotations. We have also written a [Snakefile](https://github.com/kauralasoo/txrevise/blob/master/scripts/Snakefile) for [snakemake](http://snakemake.readthedocs.io/en/stable/) that automates all of the steps and can be easily [executed in parallel](http://snakemake.readthedocs.io/en/latest/executable.html) on multiple cores or on a compute cluster. 
 
-### Dependencies
+### Automatically
+We have written a [Snakefile](https://github.com/kauralasoo/txrevise/blob/master/scripts/Snakefile) for [snakemake](http://snakemake.readthedocs.io/en/stable/) that
+ - automates all of the steps
+ - can be easily [executed in parallel](http://snakemake.readthedocs.io/en/latest/executable.html) on multiple cores or on a compute cluster if desired
+ - can use a [Singularity](https://sylabs.io/guides/3.6/user-guide/) container, preventing R package version conflicts and ensuring reproducible results
+Thus, using the Snakefile is the recommended method. Building the Singlarity container is described [here](https://github.com/kauralasoo/txrevise/blob/master/build_container.sh) and running the snakemake pipeline [here](https://github.com/kauralasoo/txrevise/blob/master/scripts/run_snakemake.sh).
+
+### Manually
+This section contains step-by-step instructions for manually constructing transcriptional events based on Ensembl transcript annotations.
+
+#### Dependencies
 Make sure that you have R 3.5 installed together with the following packages:
 
  - [optparse](https://cran.r-project.org/package=optparse)
  - [dplyr](https://cran.r-project.org/package=dplyr)
  - [purrr](https://cran.r-project.org/package=purrr)
+ - [data.table](https://cran.r-project.org/package=data.table)
+ - [ggplot2](https://cran.r-project.org/package=ggplot2)
  - [devtools](https://cran.r-project.org/package=devtools)
  - [rtracklayer](https://bioconductor.org/packages/rtracklayer/)
  - [GenomicFeatures](https://bioconductor.org/packages/GenomicFeatures/)
@@ -24,67 +49,43 @@ You can install _txrevise_ directly from GitHub using the following command:
 	library("devtools")
 	install_github("kauralasoo/txrevise")
 
-### Step 1: Download the GTF file
-First, you need to download the GTF file from the Ensembl website. For example, if you want to use Ensembl version 92, you should download the following file:
+#### Step 1: Download the GTF file
+First, you need to download the GTF file from the Ensembl website. For example, if you want to use Ensembl version 96, you should download the following file:
 
-	wget ftp://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/Homo_sapiens.GRCh38.92.gtf.gz
+	wget ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz
 
-### Step 2: Extract tanscript tags from the GTF file
+#### Step 2: Extract tanscript tags from the GTF file
 Ensembl GTF file contains a tags field marking protein coding transcript that are truncated at 5' or 3' ends. Txrevise uses this information to extend truncated transcripts. Unfortunately, the `import` function from rtracklayer does not import the tags field correctly. Thus, we need to first extract the transcript tags manually using a custom Python 3 script that comes with txrevise.
 
-	python scripts/extractTranscriptTags.py --gtf Homo_sapiens.GRCh38.92.gtf.gz > Homo_sapiens.GRCh38.92.transcript_tags.txt
+	python scripts/extractTranscriptTags.py --gtf Homo_sapiens.GRCh38.96.gtf.gz > Homo_sapiens.GRCh38.96.transcript_tags.txt
 
-### Step 3: Prepare transcript annotations for event construction
+#### Step 3: Prepare transcript annotations for event construction
 Next, we need to convert the transcript annotations (in GTF format) to a binary representation that can be efficiently used by txrevise.
 
-	Rscript scripts/prepareAnnotations.R --gtf Homo_sapiens.GRCh38.92.gtf.gz --tags Homo_sapiens.GRCh38.92.transcript_tags.txt --out Homo_sapiens.GRCh38.92.txrevise_annotations.rds
+	Rscript scripts/prepareAnnotations.R --gtf Homo_sapiens.GRCh38.96.gtf.gz --tags Homo_sapiens.GRCh38.96.transcript_tags.txt --out Homo_sapiens.GRCh38.96.txrevise_annotations.rds
 
-### Step 4: Construct transcription events
-Finally, we can use the `constructEvents.R` script to construct alternative transcription events. Since the implementation of the various algorithms in txrevise have not been optimised for efficiency, processing the full Ensembl GTF file can take several days. However, different genes can trivially processed in parallel. To simplify parallel execution, `constructEvents.R` has the `--batch` option that takes as an input two integers separated by a space. For example, '1 200' would mean that all genes are split into 200 batches and and only genes in the first are be processed. To process all genes, you simply need to iterate from '1 200' to '200 200'. 
-
-	Rscript scripts/constructEvents.R --annot Homo_sapiens.GRCh38.92.txrevise_annotations.rds --batch '1 2000' --out txrevise_events  --fill TRUE
+#### Step 4: Construct transcription events
+Finally, we can use the `constructEvents.R` script to construct alternative transcription events. Since the implementation of the various algorithms in txrevise have not been optimised for efficiency, processing the full Ensembl GTF file can take several days. However, different genes can trivially processed in parallel. To simplify parallel execution, `constructEvents.R` has the `--batch` option that takes as an input two integers separated by a space. For example, '1 200' would mean that all genes are split into 200 batches and and only genes in the first are be processed. To process all genes, you simply need to iterate from '1 200' to '200 200'.
 
-### Step 5: Merge output files
+	Rscript scripts/constructEvents.R --annot Homo_sapiens.GRCh38.96.txrevise_annotations.rds --batch '1 2000' --out txrevise_events  --fill TRUE
+
+#### Step 5: Merge output files
 Each run of `constructEvents.R` produces up to six output files: alternative promoter, internal exon and 3' end events (labeled as upstream, contained and downstream) for two possible sets of shared exons (grp_1 and grp_2). See the [vignette](http://htmlpreview.github.io/?https://github.com/kauralasoo/txrevise/blob/master/inst/doc/construct_events.html) for more details on how the events are constructed.
 
 To be able to use these events with transcript quantification software such as [Salmon](http://salmon.readthedocs.io/en/latest/) or [kallisto](https://pachterlab.github.io/kallisto/), we first need to merge all files from different batches. The `grep -v "^#"` command removes the comment lines from the beginning of each individual GFF3 file.
 
 	cat txrevise_events/txrevise.grp_1.upstream.* | grep -v "^#" > txrevise.grp_1.upstream.gff3
 	cat txrevise_events/txrevise.grp_1.contained.* | grep -v "^#" > txrevise.grp_1.contained.gff3
 	cat txrevise_events/txrevise.grp_1.downstream.* | grep -v "^#" > txrevise.grp_1.downstream.gff3
-	
+
 	cat txrevise_events/txrevise.grp_2.upstream.* | grep -v "^#" > txrevise.grp_2.upstream.gff3
 	cat txrevise_events/txrevise.grp_2.contained.* | grep -v "^#" > txrevise.grp_2.contained.gff3
 	cat txrevise_events/txrevise.grp_2.downstream.* | grep -v "^#" > txrevise.grp_2.downstream.gff3
 
-### Step 6: Quantify event expression
+#### Step 6: Quantify event expression
 The GFF3 files constructed by _txrevise_ can be used with any transcript quantification algorithm (e.g. [Salmon](http://salmon.readthedocs.io/en/latest/) or [kallisto](https://pachterlab.github.io/kallisto/)). **Importantly, since the promoter, internal exon and 3' end events constructed by txrevise share some of the same exons, they should always quantified independently**. For example, if you have six GFF3 files from txrevise (txrevise.grp_1.upstream.gff3, txrevise.grp_1.contained.gff3, txrevise.grp_1.downstream.gff3, txrevise.grp_2.upstream.gff3, txrevise.grp_2.contained.gff3, txrevise.grp_2.downstream.gff3), you should also run the quantification algorithm six times with each of the GFF3 file independently.
 
-## Pre-computed event annotations
-Running _txrevise_ on the latest version of Ensembl can be quite timeconsuming. Thus, to make it easier to get started, we have pre-computed alternatve transcription events in the GFF3 format for both GRCh37 and GRCh38 reference genomes.
-
-We have constructed two types of events. In the main event files we have masked the alternative internal exons in the promoter and 3' end events (labeled upstream and downstream). This ensures that we can reliably distinguish promoter usage and 3' end usage from alternative splicing, which are likely to be driven by distinct molecular mechanisms. The main annotation files are:
-
-- [Homo_sapiens.GRCh37.87.version_1.tar.gz](https://zenodo.org/record/1302499/files/Homo_sapiens.GRCh37.87.version_1.tar.gz)
-- [Homo_sapiens.GRCh38.92.version_1.tar.gz](https://zenodo.org/record/1302499/files/Homo_sapiens.GRCh38.92.version_1.tar.gz)
-- [Homo_sapiens.GRCh38.96.version_1.tar.gz](https://zenodo.org/record/3232932/files/Homo_sapiens.GRCh38.96.version_1.tar.gz)
-
-We have also constructed an alternative set of "raw" annotation files where the alternative internal exons in promoter and 3' end events have not masked. This maximises QTL discovery, because we can now also detect splicing event near promoters and 3 ends, but it comes at the expense that we are no longer able to distinguish between different molecular mechanisms. The raw annotation files are:
-
--   [Homo_sapiens.GRCh37.87.raw_events.version_1.tar.gz](https://zenodo.org/record/1302499/files/Homo_sapiens.GRCh37.87.raw_events.version_1.tar.gz)
--   [Homo_sapiens.GRCh38.92.raw_events.version_1.tar.gz](https://zenodo.org/record/1302499/files/Homo_sapiens.GRCh38.92.raw_events.version_1.tar.gz)
-
 ## Extracting event sequences for quantification
 Many transcript expression quantification tools (e.g. Salmon or kallisto) do not directly work with transcript annotations in GFF3 format and require the transcript sequences in FASTA format instead. The simplest way to convert GFF3 annotations into transcript (or event) sequences is to use the `gffread` tool from the [cufflinks](http://cole-trapnell-lab.github.io/cufflinks/) package:
 
 	gffread -w <output.fa> -g <reference_genome.fa> <input.gff3>
-
-<!--stackedit_data:
-eyJoaXN0b3J5IjpbLTEzMzc1MDcwODUsNTkyNTM4MTM4LC04Mz
-E3NTU5OTQsLTg2MzgwMzU3MywxNTkyMjg2MTUsMTA3NzMwMzk0
-MCwtMTgwNTc5NzIxMiwxODE0ODA1MjI4LDg4MzA3MzMxOCwtNT
-czNDg4MTc4LDE2NTY1MDQ1MDMsMTEyNjAwNjg2NSwyMDAxOTE1
-NTA1LDE0Mjk3ODk5NzIsMTU1NDYyOTIyMSwxNjQxOTM2Mzk5LD
-c1NjI1MDcwLC0xMzU0MjI0NTAsLTE0MDcxMjc3MTUsMTY1MzMx
-OTM2NV19
--->
diff --git a/build_container.sh b/build_container.sh
@@ -1,8 +1,9 @@
 ### Build the Docker container
-sudo docker build -t kauralasoo/txrevise .
+#sudo docker build -t andreasvija/txrevise .
 
 ### Push to DockerHub
-docker push kauralasoo/txrevise
+#docker push andreasvija/txrevise
 
 ### Build a local copy of the Singularity container
-singularity build txrevise.img docker://kauralasoo/txrevise:latest
+cd scripts
+singularity build txrevise.img docker://andreasvija/txrevise:latest