The inputs of DELONGSEQ are aligned long-read RNA-seq data in BAM format and a reference isoform annotation file (Ensembl/Refseq). User needs to specify -task
to perform in each step:
delongseq -task <task>:
refgene: preprocess reference file
quantify: quantify isoform expression
diff: detect differential splicing gene/isoform
DELONGSEQ accepts two formats of reference annotation file. User can download the example reference file and data from the example folder:
For example example.gtf
:
chr21 HAVANA gene 25639258 25717562 . + . gene_id "ENSG00000154721.15"; gene_type "protein_coding"; gene_name "JAM2"; level 2; hgnc_id "HGNC:14686"; tag "overlapping_locus"; havana_gene "OTTHUMG00000
078441.4";
chr21 HAVANA transcript 25639258 25717562 . + . gene_id "ENSG00000154721.15"; transcript_id "ENST00000480456.6"; gene_type "protein_coding"; gene_name "JAM2"; transcript_type "protein_coding"; tran
script_name "JAM2-206"; level 2; protein_id "ENSP00000420419.1"; transcript_support_level "1"; hgnc_id "HGNC:14686"; tag "basic"; tag "MANE_Select"; tag "appris_principal_3"; tag "CCDS"; ccdsid "CCDS42911.1"; havana_gene "OTTHUMG00000078
441.4"; havana_transcript "OTTHUMT00000171347.2";
chr21 HAVANA exon 25639258 25639888 . + . gene_id "ENSG00000154721.15"; transcript_id "ENST00000480456.6"; gene_type "protein_coding"; gene_name "JAM2"; transcript_type "protein_coding"; transcript_n
ame "JAM2-206"; exon_number 1; exon_id "ENSE00003843052.1"; level 2; protein_id "ENSP00000420419.1"; transcript_support_level "1"; hgnc_id "HGNC:14686"; tag "basic"; tag "MANE_Select"; tag "appris_principal_3"; tag "CCDS"; ccdsid "CCDS42
911.1"; havana_gene "OTTHUMG00000078441.4"; havana_transcript "OTTHUMT00000171347.2";
chr21 HAVANA CDS 25639822 25639888 . + 0 gene_id "ENSG00000154721.15"; transcript_id "ENST00000480456.6"; gene_type "protein_coding"; gene_name "JAM2"; transcript_type "protein_coding"; transcript_n
ame "JAM2-206"; exon_number 1; exon_id "ENSE00003843052.1"; level 2; protein_id "ENSP00000420419.1"; transcript_support_level "1"; hgnc_id "HGNC:14686"; tag "basic"; tag "MANE_Select"; tag "appris_principal_3"; tag "CCDS"; ccdsid "CCDS42
911.1"; havana_gene "OTTHUMG00000078441.4"; havana_transcript "OTTHUMT00000171347.2";
For example example.ucsc
:
947 ENST00000397748.5 chr21 - 47556175 47575481 47556367 47575437 15 47556175,47556828,47557152,47558421,47558793,47565330,47565731,47566179,47570032,47570301,47571471,47571805,47572820,47574062,47575383, 47556408,47556967,47557248,47558560,47558837,47565492,47565861,47566241,47570164,47570439,47571651,47571894,47572949,47574246,47575481, 0 FTCD cmpl cmpl 1,0,0,2,0,0,2,0,0,0,0,1,1,0,0,
947 ENST00000397743.1 chr21 - 47556692 47575481 47557159 47575437 13 47556692,47557152,47558421,47565330,47565731,47566179,47570032,47570301,47571471,47571805,47572820,47574062,47575383,47556987,47557248,47558560,47565492,47565861,47566241,47570164,47570439,47571651,47571894,47572949,47574246,47575481, 0 FTCD cmpl cmpl -1,1,0,0,2,0,0,0,0,1,1,0,0,
947 ENST00000397746.7 chr21 - 47556692 47575481 47556900 47575437 14 47556692,47557152,47558421,47558793,47565330,47565731,47566179,47570032,47570301,47571471,47571805,47572820,47574062,47575383, 47556987,47557248,47558560,47558837,47565492,47565861,47566241,47570164,47570439,47571651,47571894,47572949,47574246,47575481, 0 FTCD cmpl cmpl 0,0,2,0,0,2,0,0,0,0,1,1,0,0,
This reference file can be downloaded at UCSC by selecting "all fields from selected table" in output format.
Then, we preprocess reference file by using delongseq -task refgene
. An example is given below.
- For gtf format
delongseq -task refgene -ref example.gtf -format gtf -out example.refgene
- For ucsc all fields:
delongseq -task refgene -ref example.ucsc -format ucsc -out example.refgene
The output isoform compatible matrix file example.refgene
should be in following format:
RCAN3 chr1 + 24828840 24863510 NM_001251983,NM_001251978,NM_013441,NM_001251984,NM_001251979,NM_001251981,NM_001251977,NM_001251985,NM_001251982,NM_001251980,
RCAN3 chr1 + 24828840 24828953 0,0,0,1,1,0,0,0,0,0,
RCAN3 chr1 + 24829386 24829607 1,0,1,0,0,1,1,0,0,0,
RCAN3 chr1 + 24829608 24829640 1,0,1,0,0,0,0,0,0,0,
RCAN3 chr1 + 24834083 24834218 0,1,0,0,0,0,0,0,0,0,
RCAN3 chr1 + 24840803 24841057 0,1,1,0,1,1,1,1,1,1,
RCAN3 chr1 + 24857707 24857881 1,1,1,1,1,0,1,0,1,1,
RCAN3 chr1 + 24859572 24859601 1,1,1,1,1,1,1,0,0,0,
RCAN3 chr1 + 24859602 24859744 1,1,1,1,1,1,1,0,0,1,
RCAN3 chr1 + 24861582 24863510 1,1,1,1,1,1,1,1,1,1,
Note: user needs to specify correct reference file format (gtf or ucsc) in this step.
In this step, user is first suggested to perform reads filtering using samtools:
samtools view bam_file -F 2308 -q 50 -O BAM -o bam_filtered
Then, user needs to sort and index the bam_file
, and give refgene_File
, bam_file
to DELONGSEQ to estimate isoform expression using long-read RNA-seq data:
delongseq -task quantify -refgene <refgene_file> -bam <bam_file> -out <output_file> -max_distance <max distance> -f_weight <weight of F function>
where
<bam>_file: A bam file.
<refgene_file>: A reference file obtained from step 1.
<max distance>: The maximum length of an alignment error at exon boundary. Recommend: 10.
<weight of F function>: The weight for bias correction in isoform usage estimation (between 0 and 10). Recommend: 1
For example:
delongseq -task quantify -refgene example.refgene -bam example.bam -out isoform_expression_estimates -max_distance 20 -f_weight 1
In this step, user needs to give two lists of isoform expression estimates files for condition 1 and 2 to use DELONGSEQ for differential splicing gene/isoform detection.
delongseq -task diff
-condition_1 <list_of_isoform_expression_estimation_file_for_condition1>
-condition_2 <list_of_isoform_expression_estimation_file_for_condition2>
-out <test_results_file>
Here are example format for input lists:
List_Condition_1
/home/huy4/tmp/delongseq_das_tmp/isoform_estimates_condition1_sample1
/home/huy4/tmp/delongseq_das_tmp/isoform_estimates_condition1_sample2
/home/huy4/tmp/delongseq_das_tmp/isoform_estimates_condition1_sample3
List_Condition_2
/home/huy4/tmp/delongseq_das_tmp/isoform_estimates_condition2_sample1
/home/huy4/tmp/delongseq_das_tmp/isoform_estimates_condition2_sample2
/home/huy4/tmp/delongseq_das_tmp/isoform_estimates_condition2_sample3
Note: user needs to provide the absolute path for each DELONGSEQ isoform estimates file
User can download the DAS detection example data from the release page and follow the command in Demo and Examples.
Output:
"geneDASResults" file: Gene-based test results file with 3 columns:
• Column 1: gene name
• Column 2: test P-Value
• Column 3: Hellinger distance which measures the splicing difference between two conditions for a gene in terms of isoform relative abundances