Skip to content

SUPPA2 tutorial

Eileen Xue edited this page Jul 12, 2023 · 13 revisions

In this tutorial, we describe the steps for analyzing the PSI changes between 2 conditions in a publicly available dataset. We hope that this tutorial will give some kind of advice to users at the time of analyzing alternative splicing (AS) on their own datasets. For more details about the software or their performance check our paper: https://www.ncbi.nlm.nih.gov/pubmed/29571299

We are going to use the data included in Best et al, 2015¹. In this work, they analyze RNA-seq data for the knockdown of Tra2α and Tra2β in MDA-MB-231 cells and controls with 3 replicates per condition (GSE59335). These two RNA binding proteins have both been implicated in the regulation of AS. In this tutorial, we will show how to study the impact on the transcriptome associated with the depletion of TRA2 using SUPPA2.

Transcript quantification

SUPPA2 obtains the inclusion levels of AS events exploiting transcript quantification. There are different measures for isoform quantification (counts, TMM, FPKM...). We strongly recommend the use of TPM (Transcripts Per Million), because is normalized for gene length first, and then for sequencing depth. This allows direct comparison between samples (check the following video if you want to know more).

There are several tools for transcript quantification (this is an interesting review² on this matter). Each one has different pros and cons and the user is free to use the one that fits better to his requirements. We recommend using Salmon³ because is one of the fastest methods for quantification without compromising accuracy and it also offers some other interesting options like GC bias correction or bootstrapping among others. The commands indicated here are oriented for using these tools.

Index

Like many tools for alignment or mapping, Salmon needs to generate first an index with the transcript annotation (IMPORTANT: the index has to be created on the transcriptome, not on the genome. If you see that the index creation is taking too much time or memory, maybe you are using a genome instead). Again, there are different annotations (UCSC, Gencode, RefSeq...) and different versions. The user should think about which one fits better for his analysis. In this example, we are going to use Ensembl hg19. Once it is downloaded and unzipped, we can now generate the index with Salmon:

# Sometimes is necessary to explicitly export the libraries to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=~/salmon/lib:$LD_LIBRARY_PATH
~/salmon/bin/salmon index -t hg19_EnsenmblGenes_sequence_ensenmbl.fasta -i Ensembl_hg19_salmon_index

Quantification

Once it has finished, we are ready to run the quantification. Since our reads are in FASTQ files, we are going to make use of the quasi-mapping-based mode. To get details about the different parameters, you have an extended tutorial about Salmon usage.

# Example for SRR1513329
~/salmon/bin/salmon quant -i Ensembl_hg19_salmon_index -l ISF --gcBias -1 SRR1513329_1.fastq -2 SRR1513329_2.fastq -p 4 -o SRR1513329

As we mentioned at the beginning, we want the TPM values of the isoforms. SUPPA2 includes a script for the extraction of this column of the Salmon output (4th columns from quant.sf files)

# It requires python3 and pandas library (pip install pandas)
# -k indicates the row used as the index
# -f indicates the column to be extracted from the Salmon output
python ~/SUPPA/multipleFieldSelection.py -i ~/Salmon_output/*/quant.sf -k 1 -f 4 -o iso_tpm.txt

For the calculation of the events, SUPPA2 requires a GTF file. Here we provide one from Ensembl. The transcripts ids in our iso_tpm file should be equal to the ids in the GTF. We have a small script in R for formatting these ids

~/Rscript ~/scripts/format_Ensembl_ids.R iso_tpm.txt

Here is the file already formatted with the TPM of the isoforms.

Event calculation

Before performing the PSI calculation of the events, we need to extract from the annotation the event coordinates. SUPPA2 generates the different AS events reading transcript and gene information from the "exon" lines and creates an ioe file.

In this analysis, we are interested in testing all the different types of events:

#Generate the ioe files: 
python ~/SUPPA/suppa.py generateEvents -i annotation/Homo_sapiens.GRCh37.75.formatted.gtf -o ensembl_hg19.events -e SE SS MX RI FL -f ioe
#Put all the ioe events in the same file:
awk '
    FNR==1 && NR!=1 { while (/^<header>/) getline; }
    1 {print}
' *.ioe > ensembl_hg19.events.ioe

IMPORTANT: On this occasion, we have used Ensembl. If the user decides to use RefSeq annotation, we suggest using the --pooled-genes option. RefSeq genes are labeled according to the mRNA sequences they contain, not by the gene locus. This could cause that the same id is mapped to two different places in the genome. Also, sometimes two overlapping isoforms that share some exon or splice-sites are labeled as different genes. The pooled-genes option redefines a gene as a locus with transcripts on the same strand that overlap each other in genome extent and share at least one splice-site pairwise. These issues are a big deal to calculate events and estimate PSIs, since you would be assigning wrong contributions to the events because those transcripts should be considered together as part of the same gene. There are also cases like that in Gencode and Ensembl annotation, although it is less severe.

With the ioe file generated with the event coordinates, we are ready to obtain their PSI values:

python ~/SUPPA/suppa.py psiPerEvent -i ensembl_hg19.events.ioe -e iso_tpm_formatted.txt -o TRA2_events

Note: Transcript_IDs in the expression file must be the same as the transcript_IDs in the .ioe file, including the version numbers, otherwise it won't work.

The generated file (TRA2_events.psi) contains the PSI values per event and sample. In Best et al, they identify and validated 53 human splicing factors which AS has been after Tra2 protein depletion:

53 splicing factors

Let's choose one of them: CHEK1 (ENSG00000149554), which encodes a key DNA damage response protein. They describe a decrease in the inclusion of exon chr11:125497502-125497725 of this gene. Taking into account that SRR1513329, SRR1513330, and SRR1513331 are negative controls (NC) and SRR1513332, SRR1513333 and SRR1513334 are the ones with the depletion of TRA2 proteins (KD), we look for the SE event associated with this exon in our file:

Event_ID	SRR1513329	SRR1513330	SRR1513331	SRR1513332	SRR1513333	SRR1513334
ENSG00000149554;SE:chr11:125496728-125497502:125497725-125499127:+	0.7471194793	0.7892328982	0.781078454	0.4554597402	0.4493233894	0.4237912839

We provide an auxiliary script to generate boxplots with the PSI values of an event (the user should indicate which samples belong to one condition or another):

python ~/scripts/generate_boxplot_event.py -i events.psi -e "ENSG00000149554;SE:chr11:125496728-125497502:125497725-125499127:+" -g 1-3,4-6 -c NC,KD -o /home/juanluis/Desktop/Work/Master_class/

CHEK1 Boxplot PSI

Indeed, we see that this decrease is also occurring in our PSI values generated with SUPPA2.

Differential splicing with local events

SUPPA2 provides another tool for performing a differential splicing analysis between multiple conditions. With our PSI file, we just need to provide the TPM and the PSI of our samples in separate files according to the condition. The user can easily perform this with the following:

# Split the PSI and TPM files between the 2 conditions:
~/scripts/split_file.R iso_tpm_formatted.txt SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 TRA2_NC_iso.tpm TRA2_KD_iso.tpm -i
~/scripts/split_file.R events.psi SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 TRA2_NC_events.psi TRA2_KD_events.psi -e

This creates a separate file (PSI and TPM per condition). Now, we are ready to perform the differential splicing analysis:

python ~/SUPPA/suppa.py diffSplice -m empirical -gc -i ./annotation/ensembl_hg19.events.ioe -p TRA2_KD_events.psi TRA2_NC_events.psi -e TRA2_KD_iso.tpm TRA2_NC_iso.tpm -o TRA2_diffSplice

This is the output. The file TRA2_diffSplice.dpsi gives the DeltaPSI as the difference of the mean PSI between conditions, and the p_value of this difference. We can check how is the distribution of the PSI along with the expression of the events (this is the script used for generating the following plot):

MA_plot

We observe that at low expression, the deltaPSI must be greater to be detected as significant. There is a total of 979 significant events (p_value < 0.05). Our event of CHEK1 is included in our cloud of significant blue dots. We can check the associated SE events to this gene on TRA2_diffSplice.dpsi:

>grep ENSG00000149554 TRA2_diffSplice.dpsi | grep SE
ENSG00000149554;SE:chr11:125495924-125496312:125496470-125496644:+	nan	1.0
ENSG00000149554;SE:chr11:125496715-125497502:125497725-125499127:+	0.0	0.37795537799999995
**ENSG00000149554;SE:chr11:125496728-125497502:125497725-125499127:+	0.32961880600000004	0.0**
ENSG00000149554;SE:chr11:125514163-125514407:125514538-125523641:+	0.0021020729000000003	0.3806193806
ENSG00000149554;SE:chr11:125514538-125523641:125523742-125525120:+	-0.0078827407	0.37795537799999995
ENSG00000149554;SE:chr11:125525242-125526101:125526230-125545823:+	-0.2614825116	0.0774225774

We observe that the only significant event is the third one, which matches with the event measured in Best et al. So, we can confirm that this event is differentially spliced between these conditions.

Cluster analysis

Using the PSI values in all samples, and the information of which events change significantly in at least one comparison, SUPPA2 calculates clusters of events using a density-based clustering. Density-based clustering has the advantage that it can put together into the same cluster events that even though they might not have similar PSI values, their behavior is similar across conditions. This functionality is especially useful for analyzing experiments with multiple conditions.

We are gonna use SUPPA2 clustering on the following dataset from Busskamp et al 2014. In this work, a 4-day time-course of differentiation of human-induced pluripotent stem cell (iPSCs) into bipolar neurons is analyzed. SUPPA2 offers two density-based clustering methods: DBSCAN [5] and OPTICS [6]. OPTICS is an optimization of DBSCAN and we expect to obtain better clustering, with more computing time. Since our data is not so big, we are gonna use OPTICS.

First, we need to run the differential splicing step on this data. We learned before how to do this, so we provide the .dpsi and .psivec files (just using the skipping exon (SE) events). We run the following command:

python ~/SUPPA/suppa.py clusterEvents --dpsi ~/additional_files/Busskamp_multicondition.SE.dpsi --psivec ~/additional_files/Busskamp_multicondition.SE.psivec --sig-threshold 0.05 --eps 0.2 --separation 0.11 -dt 0.2 --min-pts 10 --groups 1-3,4-6,7-9,10-12 -c OPTICS -o ~/Busskamp_OPTICS

This command generates two files: a .log file with the information of the clusters found and a .clustvec, with the events, their mean psi values per condition, and the clusters association. If an event has no associated cluster it will be assigned to -1.

event_id  cluster_id  cond1_PSI_avg cond2_PSI_avg cond3_PSI_avg cond4_PSI_avg
ENSG00000108176;SE:chr10:69583150-69587300:69587441-69597692:-	2	0.245977589185	0.0242424332922	0.0522646864829	0.0450961485589
ENSG00000110906;SE:chr12:109895487-109895797:109895886-109898441:-	2	0.371902642559	0.0807209159074	0.116974207413	0.117733712549
ENSG00000129946;SE:chr19:439030-439233:439491-440862:-	2	0.356220270706	0.0483798341422	0.0356880417646	0.0212254475686
ENSG00000140386;SE:chr15:77067458-77085574:77085732-77087621:-	2	0.36711447746	0.0542151802375	0.0414659102421	0.0412346322595
ENSG00000140386;SE:chr15:77134272-77148151:77148217-77150150:-	2	0.343608286307	0.0542151802375	0.0548847391885	0.0383696931693

We obtained 3 well-differentiated clusters (silhouette score = 0.572). Cluster 0 increased inclusion at late steps of differentiation and showed enrichment in microexons (32 out of 115 events). In contrast, clusters 1 and 2 decreased inclusion with differentiation and contained 2 out of 20 events, and no microexons, respectively. These results are in agreement with the previously observed enrichment of microexon inclusion in differentiated neurons in other works [7,8]. Extracted from SUPPA2 paper, here we have how the different psi values in the clusters look like. In blue, we show the SE events involving microexons (shorter than 28 nucleotides):

Cluster_plots

In this work, we further validated these clusters performing motif enrichment analysis between the regulated clusters in the events compared to non-regulated ones using another software developed in our lab MoSEA. We show that this functionality is useful for identifying groups of events changing consistently and could open new ways of finding the biological meaning of alternative splicing changes

Differential transcript usage

We have developed a new option in SUPPA2 to test differential usage of transcripts. Transcript events are important as they can describe splicing variations that are complex and cannot be encapsulated in a simple event. It's based on the same method for differential splicing with events. We can perform the same analysis on our TRA2 data.

We are going to come back again to our TRA2 data. First, it's necessary to obtain an ioi file from the annotation. The ioi file provides for each transcript in a gene, the set of all transcripts from that gene from which the transcript relative abundance is calculated.

#Run SUPPA2 for obtaining the ioi file from the annotation
python ~/SUPPA/suppa.py generateEvents -f ioi -i ~/annotation/Homo_sapiens.GRCh37.75.formatted.gtf -o ./ensembl_hg19.isoforms

We need also to compute the PSI values of the isoforms:

python ~/SUPPA/suppa.py psiPerIsoform -g ~/annotation/Homo_sapiens.GRCh37.75.formatted.gtf -e ./iso_tpm_formatted.txt -o ./iso

Finally, we can run the analysis of the differential transcript usage:

#Split the PSI files between 2 conditions:
/usr/bin/Rscript ~/scripts/split_file.R ./iso_isoform.psi SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 TRA2_NC_iso.psi TRA2_KD_iso.psi -i
#Run SUPPA2
python ~/SUPPA/suppa.py diffSplice -m empirical -gc -i ./ensembl_hg19.isoforms.ioi -p ./TRA2_KD_iso.psi ./TRA2_NC_iso.psi -e ./TRA2_KD_iso.tpm ./TRA2_NC_iso.tpm -o ./TRA2_diffSplice_iso

This is the output. If we take a look at the transcript associated with our gene (CHEK1), we will see that there, on this occasion, there are 3 transcripts significant. Two of them overlap with the SEs previously observed (ENST00000526937, ENST00000534070). Maybe that's the reason why now the dPSI is smaller on these transcripts because it has been split among them (you can take a look at the transcripts associated with CHEK1 on the UCSC Genome Browser:

>grep ENSG00000149554 TRA2_diffSplice_iso.dpsi 
ENSG00000149554;ENST00000278916	-0.0012407707	0.3686313686
ENSG00000149554;ENST00000427383	-0.0139009432	0.3647067218
ENSG00000149554;ENST00000428830	0.0055951176	0.3686313686
ENSG00000149554;ENST00000438015	-0.0056191302	0.3686313686
ENSG00000149554;ENST00000498122	-0.0519374527	0.18743756239999998
ENSG00000149554;ENST00000524737	0.0061450515	0.3686313686
ENSG00000149554;ENST00000525396	0.0409130748	0.2068931069
**ENSG00000149554;ENST00000526937	0.10000802240000001	0.0474525475**
ENSG00000149554;ENST00000527013	0.0015880468	0.3686313686
ENSG00000149554;ENST00000528276	-0.0004250792	0.3686313686
ENSG00000149554;ENST00000528761	0.0	0.3686313686
ENSG00000149554;ENST00000531062	-0.0006220255000000001	0.3686313686
ENSG00000149554;ENST00000531607	0.036123104700000006	0.20720945719999997
**ENSG00000149554;ENST00000532449	-0.23419656989999998	0.0094905095**
ENSG00000149554;ENST00000532669	0.0	0.3686313686
ENSG00000149554;ENST00000533778	-0.0047879849	0.3686313686
**ENSG00000149554;ENST00000534070	0.1200552391	0.0474525475**
ENSG00000149554;ENST00000534685	0.0	0.3686313686
ENSG00000149554;ENST00000544373	0.0023022994	0.3686313686

If we count the number of transcripts with differential usage there is a total of 1.822, double the number of events. This provides evidence that both analyses (events and transcripts) are complementary and do not necessarily give the same information.

References

  1. "Human Tra2 proteins jointly control a CHEK1 splicing switch among alternative and constitutive target exons", Best et al, Nature Communications, 2014.
  2. "A survey of best practices for RNA-seq data analysis", Conesa et al, Genome Biology, 2016.
  3. "Salmon provides fast and bias-aware quantification of transcript expression", Patro et al, Nature Methods, 2017.
  4. "Rapid neurogenesis through transcriptional activation in human stem cells", Busskamp et al, Mol. Syst. Biol, 2014.
  5. "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", Ester et al, 2nd Int. Conf. Knowl. Discov. Data Min, 1996.
  6. "OPTICS: Ordering Points To Identify the Clustering Structure", Ankerst et al, ACM Sigmod Rec., 1999.
  7. "A highly conserved program of neuronal microexons is misregulated in autistic brains", Irimia et al, Cell, 2014
  8. "RBFOX and PTBP1 proteins regulate the alternative splicing of micro-exons in human brain transcripts", Yi et al, Genome Research, 2015.