-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting Xena datasets to standard identifiers rather than gene symbols #6
Comments
I'm pretty sure django-genes uses Entrez. I agree that they are generally a bit nicer/more stable than symbols. We might actually be able to use django-genes here if needed in the models. |
For For conversions such as this, I think it's best to use the mapping that is the inverse of whatever was used by the data creators, due to the ambiguity of gene symbols. I'm waiting to hear back from Xena Browser regarding gene mapping. It's possible, they actually have created mapping files specific to their symbols. I may be misunderstanding what |
I noticed the
Therefore, one possibility for converting symbols in |
I've just played around with trying to reproduce the gene names available in As we ultimately try to integrate the data sets, it seems like it will be important to ensure that we are using a standard reference genome version (ideally whatever version HiSeqV2 was mapped against) . For example:
A1CF location GRCh37p13: Chromosome 10: 52,559,169-52,645,435 reverse strand. |
@clairemcleod good call. It looks like HiSeqV2 is mapped to hg38 while PANCAN_mutation is mapped to hg19. We can easily update the mutation file to hg38 using a liftover tool but it is definitely important. |
re: gene symbols or else re: gencode annotation |
Xena datasets (as retrieved in #1) use symbols to identify genes rather than standardized identifiers, such as Entrez GeneIDs, ensembl gene IDs, HGNC IDs, or UCSC gene IDs. This has led to upstream data quality issues such as #4. Hence, I think it makes sense to code our databases using standardized identifiers.
Currently, we use the
HiSeqV2
andTCGA.PANCAN.sampleMap
datasets which both use symbols. Does anyone have a preferred identifier? I like Entrez GeneIDs.The text was updated successfully, but these errors were encountered: