Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link to correct companion fasta file #2

Open
hiraksarkar opened this issue Jan 3, 2017 · 8 comments
Open

Link to correct companion fasta file #2

hiraksarkar opened this issue Jan 3, 2017 · 8 comments

Comments

@hiraksarkar
Copy link

hiraksarkar commented Jan 3, 2017

To simulate the reads, i.e to execute this command,
~/RSEM-1.2.29/bin/rsem-prepare-reference --bowtie2 --bowtie2-path ~/bowtie2-2.2.4 ~/annotation/refseq_hg19_unique.fa ~/tra2_simulated_reads/rsem_reference

If I ran the above with an added --gtf option and the gtf given here , then it failed with an error According to the GTF file given, transcript NR_073460 has exons from different orientations.

I tried to generate a fasta file with the refseq gtf file given here and hg19 genome file curated from ucsc site. But I am unable to reproduce the results.

Also the tpm files corresponding to simulated dataset contains 48606 transcripts, where as the gtf file contains 56443 transcripts.

The link to the correct pair (genome fasta, gtf) or link to the transcript file would be great.

@JLTrincado
Copy link

JLTrincado commented Jan 9, 2017

Hi Hirak,

thanks again for your interest.

We didn't try to run RSEM with the gtf annotation, so we didn't realize this error.
We obtained the RefSeq annotation files (fasta and gtf) from the UCSC genome browser: https://genome.ucsc.edu/cgi-bin/hgTables

Since Salmon/Sailfish needs the fasta transcriptome for the alignment, we filter from the fasta all the duplicated ids, leaving just unique ids, and just transcripts falling in chr1-22, X and Y. The resulting file is refseq_hg19_unique.fa. So, for downstream analysis we have used this same annotation. I have uploaded it here (https://github.com/comprna/SUPPA_supplementary_data/blob/master/annotation/refseq_hg19_unique.fa.gz)

I tried today to generate this same table downloading again the original info from UCSC, but now there are more transcripts. It is possible that, since the last time I created this table, they have updated this info. I have uploaded the code for generating this file, if you are interested (https://github.com/comprna/SUPPA/blob/master/scripts/format_unique_fasta_RefSeq_annotation.py). In spite of the difference in the number, I would hope that the significant changes that we have generated for the 317 cassete exons for the benchmark will be similar no matter which version of the hg19 refseq annotation are you using. But, for the sake of reproducibility, I would use the initial ones (refseq_hg19_unique.fa).

Let me know if you are able to find the same results with this table,
Thanks again.

@hiraksarkar
Copy link
Author

hiraksarkar commented Jan 10, 2017

Hi Juanlu,

Thanks a lot for the file. I completely understand the frustration with reference change, which is really haphazard.

I posted an error message a while ago. That was something I did wrong.. so never mind, thanks again for the clean fasta.

I ran the whole pipeline with the clean fasta file. The pipeline is uploaded in this gist. I only experimented with 60M reads. But unfortunately the final dpsi file does not contain the 317 cassette events with significant pvalue (i.e. less than 0.05).

Can you please take a look at the gist file, and let me know I am not doing any mistake.

@JLTrincado
Copy link

Hi Hirak,

I've checked your pipeline and it seems to be ok.

We have generated a theoretical change in the relative abundance of the transcripts for 317 cassette events, but neither SUPPA nor any of the other methods are able to detect significance for all the events. With 60M reads, we observed that SUPPA generated a dPSI for all of them, but only significant for 302. I hope that you find the same number.

If you want to know the numbers for the rest of the executions, it's available in the supplementary data on our preprint (Table S3): http://biorxiv.org/content/early/2016/11/10/086876.

Thanks again

@hiraksarkar
Copy link
Author

hiraksarkar commented Jan 10, 2017

Hi Juanlu

we observed that SUPPA generated a dPSI for all of them, but only significant for 302. I hope that you find the same number.

What is the criteria for statistical significance here ? Is it pvalue < 0.05 !

@JLTrincado
Copy link

Yes, you're right. pvalue < 0.05.

@hiraksarkar
Copy link
Author

Unfortunately none of my p-values are below 0.05

@hiraksarkar
Copy link
Author

hiraksarkar commented Jan 10, 2017

I emailed with the exact doubt and created a repo with the files. Hope that would help to figure out any mistakes I might be committing while running diffSplice.

@EduEyras
Copy link
Member

Hi Hirak,

thanks for your message. The GTF for the RefSeq annotation is in general useless because the given the same ID, e.g. NR_073460, to different gene loci. This happens in general with non-standard chromosomes (DNA not mapped to chr1-22, chrX,Y), but it also happens in standard chromosomes. This is because RefSeq is originally a database of mRNA/protein sequences for genes, rather than a database of gene loci (like for instance Ensembl). As a consequence, when they map RefSeq sequences to the genome, they have multimappers, due to paralogy.

Removing duplicates from the Fasta file is a solution for that.
When working with the GTF, we also usually run the option -p "pool genes" with suppa.py generateEvents. This will first identify gene loci independently of the RefSeq label, and then will generate events per gene loci, and will solve this issue.

The script make_genes.pl that I put here https://github.com/comprna/humming does the same clustering but in a stand alone way. It will give you a GTF that should not give you the same type of error.

I hope this helps
best

Eduardo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants