-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link to correct companion fasta file #2
Comments
Hi Hirak, thanks again for your interest. We didn't try to run RSEM with the gtf annotation, so we didn't realize this error. Since Salmon/Sailfish needs the fasta transcriptome for the alignment, we filter from the fasta all the duplicated ids, leaving just unique ids, and just transcripts falling in chr1-22, X and Y. The resulting file is refseq_hg19_unique.fa. So, for downstream analysis we have used this same annotation. I have uploaded it here (https://github.com/comprna/SUPPA_supplementary_data/blob/master/annotation/refseq_hg19_unique.fa.gz) I tried today to generate this same table downloading again the original info from UCSC, but now there are more transcripts. It is possible that, since the last time I created this table, they have updated this info. I have uploaded the code for generating this file, if you are interested (https://github.com/comprna/SUPPA/blob/master/scripts/format_unique_fasta_RefSeq_annotation.py). In spite of the difference in the number, I would hope that the significant changes that we have generated for the 317 cassete exons for the benchmark will be similar no matter which version of the hg19 refseq annotation are you using. But, for the sake of reproducibility, I would use the initial ones (refseq_hg19_unique.fa). Let me know if you are able to find the same results with this table, |
Hi Juanlu, Thanks a lot for the file. I completely understand the frustration with reference change, which is really haphazard. I posted an error message a while ago. That was something I did wrong.. so never mind, thanks again for the clean fasta. I ran the whole pipeline with the clean fasta file. The pipeline is uploaded in this gist. I only experimented with 60M reads. But unfortunately the final dpsi file does not contain the 317 cassette events with significant pvalue (i.e. less than 0.05). Can you please take a look at the gist file, and let me know I am not doing any mistake. |
Hi Hirak, I've checked your pipeline and it seems to be ok. We have generated a theoretical change in the relative abundance of the transcripts for 317 cassette events, but neither SUPPA nor any of the other methods are able to detect significance for all the events. With 60M reads, we observed that SUPPA generated a dPSI for all of them, but only significant for 302. I hope that you find the same number. If you want to know the numbers for the rest of the executions, it's available in the supplementary data on our preprint (Table S3): http://biorxiv.org/content/early/2016/11/10/086876. Thanks again |
Hi Juanlu
What is the criteria for statistical significance here ? Is it pvalue < 0.05 ! |
Yes, you're right. pvalue < 0.05. |
Unfortunately none of my p-values are below 0.05 |
I emailed with the exact doubt and created a repo with the files. Hope that would help to figure out any mistakes I might be committing while running diffSplice. |
Hi Hirak, thanks for your message. The GTF for the RefSeq annotation is in general useless because the given the same ID, e.g. NR_073460, to different gene loci. This happens in general with non-standard chromosomes (DNA not mapped to chr1-22, chrX,Y), but it also happens in standard chromosomes. This is because RefSeq is originally a database of mRNA/protein sequences for genes, rather than a database of gene loci (like for instance Ensembl). As a consequence, when they map RefSeq sequences to the genome, they have multimappers, due to paralogy. Removing duplicates from the Fasta file is a solution for that. The script make_genes.pl that I put here https://github.com/comprna/humming does the same clustering but in a stand alone way. It will give you a GTF that should not give you the same type of error. I hope this helps Eduardo |
To simulate the reads, i.e to execute this command,
~/RSEM-1.2.29/bin/rsem-prepare-reference --bowtie2 --bowtie2-path ~/bowtie2-2.2.4 ~/annotation/refseq_hg19_unique.fa ~/tra2_simulated_reads/rsem_reference
If I ran the above with an added
--gtf
option and the gtf given here , then it failed with an errorAccording to the GTF file given, transcript NR_073460 has exons from different orientations
.I tried to generate a fasta file with the refseq gtf file given here and hg19 genome file curated from ucsc site. But I am unable to reproduce the results.
Also the tpm files corresponding to simulated dataset contains 48606 transcripts, where as the gtf file contains 56443 transcripts.
The link to the correct pair (genome fasta, gtf) or link to the transcript file would be great.
The text was updated successfully, but these errors were encountered: