Converting Xena datasets to standard identifiers rather than gene symbols #6

dhimmel · 2016-07-17T01:41:21Z

Xena datasets (as retrieved in #1) use symbols to identify genes rather than standardized identifiers, such as Entrez GeneIDs, ensembl gene IDs, HGNC IDs, or UCSC gene IDs. This has led to upstream data quality issues such as #4. Hence, I think it makes sense to code our databases using standardized identifiers.

Currently, we use the HiSeqV2 and TCGA.PANCAN.sampleMap datasets which both use symbols. Does anyone have a preferred identifier? I like Entrez GeneIDs.

The text was updated successfully, but these errors were encountered:

cgreene · 2016-07-17T01:44:35Z

I'm pretty sure django-genes uses Entrez. I agree that they are generally a bit nicer/more stable than symbols.

We might actually be able to use django-genes here if needed in the models.

dhimmel · 2016-07-17T02:10:13Z

For cancer-data, I think it probably makes sense to work with tables for converting genes, rather than greenelab/django-genes, as we would wound end up making lot's of API calls.

For conversions such as this, I think it's best to use the mapping that is the inverse of whatever was used by the data creators, due to the ambiguity of gene symbols. I'm waiting to hear back from Xena Browser regarding gene mapping. It's possible, they actually have created mapping files specific to their symbols.

I may be misunderstanding what django-genes does, but it seems like it may be of the most utility for providing django-cognoma or the javascript app with additional gene metadata for a small set of identifiers?

dhimmel · 2016-07-17T18:01:34Z

I noticed the HiSeqV2metadata includes a probeMap attribute with the value /probeMap/hugo_gencode_v24_gtf. There's a file located at https://tcga.xenahubs.net/download/probeMap/hugo_gencode_v24_gtf (metadata) whose head is:

id	gene	chrom	chromStart	chromEnd	strand
DDX11L1	DDX11L1	chr1	11869	14409	+
WASH7P	WASH7P	chr1	14404	29570	-
MIR6859-1	MIR6859-1	chr1	17369	17436	-
RP11-34P13.3	RP11-34P13.3	chr1	29554	31109	+
MIR1302-2	MIR1302-2	chr1	30366	30503	+
FAM138A	FAM138A	chr1	34554	36081	-

Therefore, one possibility for converting symbols in HiSeqV2 to standardized IDs would be to use the genomic location information available in probeMap/hugo_gencode_v24_gtf. I noticed this file also contained the date-naming issue discussed in #4. Therefore, an the corruption is potentially reversible.

clairemcleod · 2016-07-18T22:47:33Z

I've just played around with trying to reproduce the gene names available in PANCAN_mutation by mapping locations to the corresponding hugo_gencode_v24_gtf Ensembl IDs. For many observations, the Ensembl gene_id data aren't matching the original gene names in the PANCAN dataset. It seems like this difference may be due to the update from genome assembly GRCh37 to GRCh38 (i.e. mutation data was potentially labeled using GRCh37, but the gtf file seems based on GRCh38; see example below).

As we ultimately try to integrate the data sets, it seems like it will be important to ensure that we are using a standard reference genome version (ideally whatever version HiSeqV2 was mapped against) .

For example:

sample_id	chromosome	gene (`PANCAN_mutation`)	gene_id (gtf file)	corresponding gene (via Ensembl)	start (`PANCAN_mutation`)	start (gtf)	end (gtf)
TCGA-D8-A1J8-01	chr10	A1CF	ENSG00000228651.1	RP11-556E13.1	52,587,953	52,556,702	52,755,409

A1CF location GRCh37p13: Chromosome 10: 52,559,169-52,645,435 reverse strand.
A1CF location GRCh38p5: Chromosome 10: 50,799,409-50,885,675 reverse strand.

gwaybio · 2016-07-19T11:14:22Z

@clairemcleod good call. It looks like HiSeqV2 is mapped to hg38 while PANCAN_mutation is mapped to hg19.

We can easily update the mutation file to hg38 using a liftover tool but it is definitely important.

ypar · 2016-07-28T04:34:44Z

re: gene symbols or else
I am not aware of 1-to-1 conversion between ensembl ids and either gene symbols or entrez ids.
Also I am not aware of correct conversions between hg19 and hg38. A lot of contigs and other previously ambiguous regions have been resolved in hg38. It is definitely recommended for new assemblies or alignments, but as an annotation, I'd recommend that we are more careful and make sure liftover is doing the right thing.

re: gencode annotation
If you are merely matching id's for preliminary checks, gencode v19 is the latest update for grch37/hg19.

dhimmel · 2016-08-26T22:16:25Z

AFAIC #10 and #12 have addressed this issue. We're now operating entirely using Entrez GeneIDs.

dhimmel added the task label Jul 17, 2016

gwaybio mentioned this issue Jul 20, 2016

Status Chooser - Options cognoma/frontend#1

Closed

This was referenced Jul 23, 2016

Map HiSeqV2 symbols to entrez gene IDs #8

Merged

Gene names converted to dates in Xena's PANCAN_mutation dataset #4

Open

clairemcleod self-assigned this Jul 26, 2016

clairemcleod mentioned this issue Jul 27, 2016

Map mutation gene symbols to Entrez IDs #12

Merged

dhimmel closed this as completed Aug 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting Xena datasets to standard identifiers rather than gene symbols #6

Converting Xena datasets to standard identifiers rather than gene symbols #6

dhimmel commented Jul 17, 2016

cgreene commented Jul 17, 2016 •

edited by dhimmel

Loading

dhimmel commented Jul 17, 2016 •

edited

Loading

dhimmel commented Jul 17, 2016

clairemcleod commented Jul 18, 2016 •

edited

Loading

gwaybio commented Jul 19, 2016

ypar commented Jul 28, 2016

dhimmel commented Aug 26, 2016

Converting Xena datasets to standard identifiers rather than gene symbols #6

Converting Xena datasets to standard identifiers rather than gene symbols #6

Comments

dhimmel commented Jul 17, 2016

cgreene commented Jul 17, 2016 • edited by dhimmel Loading

dhimmel commented Jul 17, 2016 • edited Loading

dhimmel commented Jul 17, 2016

clairemcleod commented Jul 18, 2016 • edited Loading

gwaybio commented Jul 19, 2016

ypar commented Jul 28, 2016

dhimmel commented Aug 26, 2016

cgreene commented Jul 17, 2016 •

edited by dhimmel

Loading

dhimmel commented Jul 17, 2016 •

edited

Loading

clairemcleod commented Jul 18, 2016 •

edited

Loading