Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting Xena datasets to standard identifiers rather than gene symbols #6

Closed
dhimmel opened this issue Jul 17, 2016 · 7 comments
Closed
Assignees
Labels

Comments

@dhimmel
Copy link
Member

dhimmel commented Jul 17, 2016

Xena datasets (as retrieved in #1) use symbols to identify genes rather than standardized identifiers, such as Entrez GeneIDs, ensembl gene IDs, HGNC IDs, or UCSC gene IDs. This has led to upstream data quality issues such as #4. Hence, I think it makes sense to code our databases using standardized identifiers.

Currently, we use the HiSeqV2 and TCGA.PANCAN.sampleMap datasets which both use symbols. Does anyone have a preferred identifier? I like Entrez GeneIDs.

@dhimmel dhimmel added the task label Jul 17, 2016
@cgreene
Copy link
Member

cgreene commented Jul 17, 2016

I'm pretty sure django-genes uses Entrez. I agree that they are generally a bit nicer/more stable than symbols.

We might actually be able to use django-genes here if needed in the models.

@dhimmel
Copy link
Member Author

dhimmel commented Jul 17, 2016

For cancer-data, I think it probably makes sense to work with tables for converting genes, rather than greenelab/django-genes, as we would wound end up making lot's of API calls.

For conversions such as this, I think it's best to use the mapping that is the inverse of whatever was used by the data creators, due to the ambiguity of gene symbols. I'm waiting to hear back from Xena Browser regarding gene mapping. It's possible, they actually have created mapping files specific to their symbols.

I may be misunderstanding what django-genes does, but it seems like it may be of the most utility for providing django-cognoma or the javascript app with additional gene metadata for a small set of identifiers?

@dhimmel
Copy link
Member Author

dhimmel commented Jul 17, 2016

I noticed the HiSeqV2metadata includes a probeMap attribute with the value /probeMap/hugo_gencode_v24_gtf. There's a file located at https://tcga.xenahubs.net/download/probeMap/hugo_gencode_v24_gtf (metadata) whose head is:

id gene chrom chromStart chromEnd strand
DDX11L1 DDX11L1 chr1 11869 14409 +
WASH7P WASH7P chr1 14404 29570 -
MIR6859-1 MIR6859-1 chr1 17369 17436 -
RP11-34P13.3 RP11-34P13.3 chr1 29554 31109 +
MIR1302-2 MIR1302-2 chr1 30366 30503 +
FAM138A FAM138A chr1 34554 36081 -

Therefore, one possibility for converting symbols in HiSeqV2 to standardized IDs would be to use the genomic location information available in probeMap/hugo_gencode_v24_gtf. I noticed this file also contained the date-naming issue discussed in #4. Therefore, an the corruption is potentially reversible.

@clairemcleod
Copy link
Member

clairemcleod commented Jul 18, 2016

I've just played around with trying to reproduce the gene names available in PANCAN_mutation by mapping locations to the corresponding hugo_gencode_v24_gtf Ensembl IDs. For many observations, the Ensembl gene_id data aren't matching the original gene names in the PANCAN dataset. It seems like this difference may be due to the update from genome assembly GRCh37 to GRCh38 (i.e. mutation data was potentially labeled using GRCh37, but the gtf file seems based on GRCh38; see example below).

As we ultimately try to integrate the data sets, it seems like it will be important to ensure that we are using a standard reference genome version (ideally whatever version HiSeqV2 was mapped against) .


For example:

sample_id chromosome gene (PANCAN_mutation) gene_id (gtf file) corresponding gene (via Ensembl) start (PANCAN_mutation) start (gtf) end (gtf)
TCGA-D8-A1J8-01 chr10 A1CF ENSG00000228651.1 RP11-556E13.1 52,587,953 52,556,702 52,755,409

A1CF location GRCh37p13: Chromosome 10: 52,559,169-52,645,435 reverse strand.
A1CF location GRCh38p5: Chromosome 10: 50,799,409-50,885,675 reverse strand.

@gwaybio
Copy link
Member

gwaybio commented Jul 19, 2016

@clairemcleod good call. It looks like HiSeqV2 is mapped to hg38 while PANCAN_mutation is mapped to hg19.

We can easily update the mutation file to hg38 using a liftover tool but it is definitely important.

@ypar
Copy link

ypar commented Jul 28, 2016

re: gene symbols or else
I am not aware of 1-to-1 conversion between ensembl ids and either gene symbols or entrez ids.
Also I am not aware of correct conversions between hg19 and hg38. A lot of contigs and other previously ambiguous regions have been resolved in hg38. It is definitely recommended for new assemblies or alignments, but as an annotation, I'd recommend that we are more careful and make sure liftover is doing the right thing.

re: gencode annotation
If you are merely matching id's for preliminary checks, gencode v19 is the latest update for grch37/hg19.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 26, 2016

AFAIC #10 and #12 have addressed this issue. We're now operating entirely using Entrez GeneIDs.

@dhimmel dhimmel closed this as completed Aug 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants