Skip to content

SUPPA2 tutorial

JLTrincado edited this page Feb 15, 2018 · 13 revisions

In this section we are gonna explain the usual steps for analyzing the PSI changes between 2 conditions in an available and public dataset. We hope that this tutorial will give some kind of advice to users at the time of analyzing alternative splicing (AS) on their own datasets.

We are going to use the data included in Best et al, 2015¹. In this work they analyze RNA-seq data for the knockdown of Tra2α and Tra2β in MDA-MB-231 cells and controls with 3 replicates per condition (GSE59335). These two RNA binding proteins have both been implicated in the regulation of AS. In this tutorial, we will show how to to study the impact on the transcriptome associated to the depletion of TRA2 using SUPPA2.

Transcript quantification

SUPPA2 obtains the inclusion levels of AS events exploiting transcript quantification. There are different measures for isoform quantification (counts, TMM, FPKM...). We strongly recommend the use of TPM (Transcripts Per Million), because is normalized for gene length first, and then for sequencing depth. This allows direct comparison between samples (check the following video if you want to know more).

There are several tools for transcript quantification (this is an interesting review² on this matter). Each one has different pros and cons and the user is free to use the one that fits better to his requirements. We recommend to use Salmon³, because is one of the fastest methods for quantification without compromising accuracy and it also offers some other interesting options like GC bias correction or bootstrapping among others. The commands indicated here are oriented for using this tools.

Index

Like many tools for alignment or mapping, Salmon needs to generate first an index with the transcript annotation (IMPORTANT: the index has to be created on the transcriptome, not on the genome. If you see that the index creation is taking too much time or memory, maybe you are using a genome instead). Again, there are different annotations (UCSC, Gencode, RefSeq...) and different versions. The user should think which one fits better for his analysis. In this example, we are going to use Ensembl hg19. Once it is downloaded and unzipped, we can now generate the index with Salmon:

# Sometimes is necessary to explicitly export the libraries to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=~/salmon/lib:$LD_LIBRARY_PATH
~/salmon/bin/salmon index -t hg19_EnsenmblGenes_sequence_ensenmbl.fasta -i Ensembl_hg19_salmon_index

Quantification

Once it has finished, we are ready to run the quantification. Since our reads are in FASTQ files, we are going to make use of the quasi-mapping-based mode. In order to get details about the different parameters, you have an extended tutorial about Salmon usage.

# Example for SRR1513329
~/salmon/bin/salmon quant -i Ensembl_hg19_salmon_index -l ISF --gcBias -1 SRR1513329_1.fastq -2 SRR1513329_1.fastq -p 4 -o SRR1513329

As we mentioned at the beginning, we want the TPM values of the isoforms. SUPPA2 includes a script for the extraction of this column of the Salmon output (4th columns from quant.sf files)

# It requires python3 and pandas library (pip install pandas)
# -k indicates the row used as index
# -f indicates the column to be extracted from the Salmon output
python ~/SUPPA/multipleFieldSelection.py -i ~/Salmon_output/*/quant.sf -k 1 -f 4 -o iso_tpm.txt

For the calculation of the events, SUPPA2 requires a GTF file. Here we provide one from Ensembl. The transcripts ids in our iso_tpm file should be equal as the ids in the GTF. We have a small script in R for formatting this ids

~/Rscript ~/scripts/format_Ensembl_ids.R iso_tpm.txt

Here is the file already formatted with the TPM of the isoforms.

Event calculation

Before performing the PSI calculation of the events, we need to extract from the annotation the event coordinates . SUPPA2 generates the different AS events reading transcript and gene information from the "exon" lines and creates an ioe file.

In this analysis, we are interested in testing all the different types of events:

#Generate the ioe files: 
python ~/SUPPA/suppa.py generateEvents -i annotation/Homo_sapiens.GRCh37.75.formatted.gtf -o ensembl_hg19.events -e SE SS MX RI FL
#Put all the ioe events in the same file:
awk '
    FNR==1 && NR!=1 { while (/^<header>/) getline; }
    1 {print}
' *.ioe > ensembl_hg19.events.ioe

IMPORTANT: In this ocassion we have used Ensembl. If the user decide to use RefSeq annotation, we suggest to use the --pooled-genes option. RefSeq genes are labeled according to the mRNA sequences they contain, not by the gene locus. This could cause that the same id is mapped to two different places in the genome. Also, sometimes two overlapping isoforms that share some exon or splice-sites are labelled as different genes. The pooled-genes option redefines a gene as a locus with transcripts on the same strand that overlap each other in genome extent and share at least one splice-site pairwise. These issues are a big deal to calculate events and estimate PSIs, since you would be assigning wrong contributions to the events because those transcripts should be considered together as part of the same gene. There are also cases like that in Gencode and Ensembl annotation, although it is less severe.

With the ioe file generated with the event cordinates, we are ready to obtain their PSI values:

python ~/SUPPA/suppa.py psiPerEvent -i ensembl_hg19.events.ioe -e iso_tpm_formatted.txt -o TRA2_events

The generated file (TRA2_events.psi) contains the PSI values per event and sample. In Best et al, they identify and validated 53 human splicing factors which AS has been after Tra2 protein depletion:

53 splicing factors

Let's choose one of them: CHEK1 (ENSG00000149554), which encodes a key DNA damage response protein. They describe a decrease in the inclusion of exon chr11:125497502-125497725 of this gene. Taking into account that SRR1513329,SRR1513330 and SRR1513331 are negative controls (NC) and SRR1513332,SRR1513333 and SRR1513334 are the ones with the depletion of TRA2 proteins (KD), we look for the SE event associated to this exon in our file:

Event_ID	SRR1513329	SRR1513330	SRR1513331	SRR1513332	SRR1513333	SRR1513334
ENSG00000149554;SE:chr11:125496728-125497502:125497725-125499127:+	0.7471194793	0.7892328982	0.781078454	0.4554597402	0.4493233894	0.4237912839

We provide and auxiliary script to generate boxplots with the PSI values of an event (the user should indicate which samples belong to one condition or another):

python ~/scripts/generate_boxplot_event.py -i events.psi -e "ENSG00000149554;SE:chr11:125496728-125497502:125497725-125499127:+" -g 1-3,4-6 -c NC,KD -o /home/juanluis/Desktop/Work/Master_class/

CHEK1 Boxplot PSI

Indeed, we see that this decrease is also occurring in our PSI values generated with SUPPA2.

Differential splicing with local events

SUPPA2 provides another tool for performing a differential splicing analysis between multiple conditions. With our PSI file, we just need to provide the TPM and the PSI of our samples in separate files according to the condition. The user can easily perform this with the following:

# Split the PSI and TPM files between the 2 conditions:
~/scripts/split_file.R iso_tpm_formatted.txt SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 TRA2_NC_iso.tpm TRA2_KD_iso.tpm -i
~/scripts/split_file.R events.psi SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 TRA2_NC_events.psi TRA2_KD_events.psi -e

This creates a separate file (PSI and TPM per condition). Now, we are ready to perform the differential splicing analysis:

python ~/SUPPA/suppa.py diffSplice -m empirical -gc -i ./annotation/ensembl_hg19.events.ioe -p TRA2_KD_events.psi TRA2_NC_events.psi -e TRA2_KD_iso.tpm TRA2_NC_iso.tpm -o TRA2_diffSplice

This is the output. The file TRA2_diffSplice.dpsi gives the DeltaPSI as the difference of the mean PSI between conditions, and the p_value of this difference. We can check how is the distribution of the PSI along the expression of the events (this is the script used for generating the following plot):

MA_plot

We observe that the smaller the expression, the greater the DeltaPSI has to be in order to be detected as significant. There is a total of 979 significant events (p_value < 0.05). Our event of CHEK1 is included in our cloud of significant blue dots. We can check the associated SE events to this gene on TRA2_diffSplice.dpsi:

>grep ENSG00000149554 TRA2_diffSplice.dpsi | grep SE
ENSG00000149554;SE:chr11:125495924-125496312:125496470-125496644:+	nan	1.0
ENSG00000149554;SE:chr11:125496715-125497502:125497725-125499127:+	0.0	0.37795537799999995
**ENSG00000149554;SE:chr11:125496728-125497502:125497725-125499127:+	0.32961880600000004	0.0**
ENSG00000149554;SE:chr11:125514163-125514407:125514538-125523641:+	0.0021020729000000003	0.3806193806
ENSG00000149554;SE:chr11:125514538-125523641:125523742-125525120:+	-0.0078827407	0.37795537799999995
ENSG00000149554;SE:chr11:125525242-125526101:125526230-125545823:+	-0.2614825116	0.0774225774

We observe that the only one significant is the third event, that matches with the event measured in Best et al. So, we can confirm that this events is differentially spliced between this conditions.

Cluster analysis

Using the PSI values in all samples, and the information of which events change significantly in at least one comparison, SUPPA2 calculates clusters of events using a density-based clustering. Density-based clustering has the advantage that it can put together into the same cluster events that even though they might not have similar PSI values, their behavior is similar across conditions. This functionality is specially useful for analyzing experiments with multiple conditions.

We are gonna use SUPPA2 clustering on the following dataset from Busskamp et al 2014. In this work, a 4-day time-course of differentiation of human induced pluripotent stem cell (iPSCs) into bipolar neurons is analyzed. SUPPA2 offers two density-based clustering methods: DBSCAN [5] and OPTICS [6]. OPTICS is an optimization of DBSCAN and we expect to obtain better clustering, with more computing time. Since our data is not so big, we are gonna use of OPTICS.

First, we need to run differential splicing step on this data. We learned before how to do this, so we provide the .dpsi and .psivec files (just using the skipping exon (SE) events). We run the following the command:

python ~/SUPPA/suppa.py clusterEvents --dpsi ~/additional_files/Busskamp_multicondition.SE.dpsi --psivec ~/additional_files/Busskamp_multicondition.SE.psivec --sig-threshold 0.05 --eps 0.2 --separation 0.11 -dt 0.2 --min-pts 10 --groups 1-3,4-6,7-9,10-12 -c OPTICS -o ~/Busskamp_OPTICS

This command generates two files: a .log file with the information of the clusters found and a .clustvec, with the events, their mean psi values per condition and the clusters association. If an event has no associated cluster it will be assigned to -1.

event_id  cluster_id  cond1_PSI_avg cond2_PSI_avg cond3_PSI_avg cond4_PSI_avg
ENSG00000108176;SE:chr10:69583150-69587300:69587441-69597692:-	2	0.245977589185	0.0242424332922	0.0522646864829	0.0450961485589
ENSG00000110906;SE:chr12:109895487-109895797:109895886-109898441:-	2	0.371902642559	0.0807209159074	0.116974207413	0.117733712549
ENSG00000129946;SE:chr19:439030-439233:439491-440862:-	2	0.356220270706	0.0483798341422	0.0356880417646	0.0212254475686
ENSG00000140386;SE:chr15:77067458-77085574:77085732-77087621:-	2	0.36711447746	0.0542151802375	0.0414659102421	0.0412346322595
ENSG00000140386;SE:chr15:77134272-77148151:77148217-77150150:-	2	0.343608286307	0.0542151802375	0.0548847391885	0.0383696931693

We obtained 3 well-differentiated clusters (silhouette score = 0.572). Cluster 0 increased inclusion at late steps of differentiation and showed an enrichment in microexons (32 out of 115 events). In contrast, clusters 1 and 2 decreased inclusion with differentiation, and contained 2 out of 20 events, and no microexons, respectively. These results are in agreement with previously observed enrichment of microexon inclusion in differentiated neurons in other works [7,8]. Extracted from SUPPA2 paper, here we have how the different psi values in the clusters look like. In blue are the microexonic events (shorter than 28 nt):

Cluster_plots

In this work, we further validated this clusters performing motif enrichment analysis between the regulated clusters in the events compared to non-regulated using another software developed in our lab MoSEA. We show that this functionality is useful for identifying groups of events changing consistently and could open new ways of finding the biological meaning of alternative splicing changes

Differential transcript usage

We have developed a new option in SUPPA2 in order to test differential usage of transcripts. Transcript events are important as they can describe splicing variations that are complex and cannot be encapsulated in a simple event. It's based in the same method for differential splicing with events. We can perform the same analysis on our TRA2 data.

IMPORTANT: This functionality is still on test. If you want to try it, clone the "testing" branch.

We are going to come back again to our TRA2 data. First, it's necessary to obtain an ioi file from the annotation. The ioi file provides for each transcript in a gene, the set of all transcripts from that gene from which the transcript relative abundance is calculated.

#Run SUPPA2 for obtaining the ioi file from the annotation (we will take a look just on the SE)
python ~/SUPPA-testing/suppa.py generateEvents -f ioi -i ~/annotation/Homo_sapiens.GRCh37.75.formatted.gtf -o ./ensembl_hg19.isoforms -e SE 

We need also to compute psi of the isoforms:

python ~/SUPPA-testing/suppa.py psiPerIsoform -g ~/annotation/Homo_sapiens.GRCh37.75.formatted.gtf -e ./iso_tpm_formatted.txt -o ./iso

Finally, we can run the analysis of the differential transcript usage:

#Split the PSI files between 2 conditions:
/usr/bin/Rscript ~/scripts/split_file.R ./iso_isoform.psi SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 TRA2_NC_iso.psi TRA2_KD_iso.psi -i
#Run SUPPA2
python ~/SUPPA-testing/suppa.py diffSplice -m empirical -gc -i ./ensembl_hg19.isoforms.ioi -p ./TRA2_KD_iso.psi ./TRA2_NC_iso.psi -e ./TRA2_KD_iso.tpm ./TRA2_NC_iso.tpm -o ./TRA2_diffSplice_iso

This is the output. If we take a look on the transcript associated to our gene (CHEK1), we will see that there, in this occasion, there are 3 transcripts significant. Two of them overlap with the SE seen previously (ENST00000526937, ENST00000534070). Maybe that's the reason why now the dPSI is smaller on this transcripts, because it has been splitted among them (you can take a look on the transcripts associated to CHEK1 on the UCSC Genome Browser:

>grep ENSG00000149554 TRA2_diffSplice_iso.dpsi 
ENSG00000149554;ENST00000278916	-0.0012407707	0.3686313686
ENSG00000149554;ENST00000427383	-0.0139009432	0.3647067218
ENSG00000149554;ENST00000428830	0.0055951176	0.3686313686
ENSG00000149554;ENST00000438015	-0.0056191302	0.3686313686
ENSG00000149554;ENST00000498122	-0.0519374527	0.18743756239999998
ENSG00000149554;ENST00000524737	0.0061450515	0.3686313686
ENSG00000149554;ENST00000525396	0.0409130748	0.2068931069
**ENSG00000149554;ENST00000526937	0.10000802240000001	0.0474525475**
ENSG00000149554;ENST00000527013	0.0015880468	0.3686313686
ENSG00000149554;ENST00000528276	-0.0004250792	0.3686313686
ENSG00000149554;ENST00000528761	0.0	0.3686313686
ENSG00000149554;ENST00000531062	-0.0006220255000000001	0.3686313686
ENSG00000149554;ENST00000531607	0.036123104700000006	0.20720945719999997
**ENSG00000149554;ENST00000532449	-0.23419656989999998	0.0094905095**
ENSG00000149554;ENST00000532669	0.0	0.3686313686
ENSG00000149554;ENST00000533778	-0.0047879849	0.3686313686
**ENSG00000149554;ENST00000534070	0.1200552391	0.0474525475**
ENSG00000149554;ENST00000534685	0.0	0.3686313686
ENSG00000149554;ENST00000544373	0.0023022994	0.3686313686

If we count the number of transcripts with differentially usage there is a total of 1.822, the double of the number of events. This evidence the fact that both analysis (events and transcripts) are complementary and don't necessary give the same information.

References

  1. "Human Tra2 proteins jointly control a CHEK1 splicing switch among alternative and constitutive target exons", Best et al, Nature Communications, 2014.
  2. "A survey of best practices for RNA-seq data analysis", Conesa et al, Genome Biology, 2016.
  3. "Salmon provides fast and bias-aware quantification of transcript expression", Patro et al, Nature Methods, 2017.
  4. "Rapid neurogenesis through transcriptional activation in human stem cells", Busskamp et al, Mol. Syst. Biol, 2014.
  5. "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", Ester et al, 2nd Int. Conf. Knowl. Discov. Data Min, 1996.
  6. "OPTICS: Ordering Points To Identify the Clustering Structure", Ankerst et al, ACM Sigmod Rec., 1999.
  7. "A highly conserved program of neuronal microexons is misregulated in autistic brains", Irimia et al, Cell, 2014
  8. "RBFOX and PTBP1 proteins regulate the alternative splicing of micro-exons in human brain transcripts", Yi et al, Genome Research, 2015.