Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genes with multiple chromosomes #2

Closed
dhimmel opened this issue Oct 7, 2016 · 8 comments
Closed

Genes with multiple chromosomes #2

dhimmel opened this issue Oct 7, 2016 · 8 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Oct 7, 2016

What does it mean for a gene to have multiple chromosomes? Here are all the genes from genes.tsv that exhibited multiple chromosomes:

entrez_gene_id symbol description chromosome gene_type synonyms
263 AMD1P2 adenosylmethionine decarboxylase 1 pseudogene 2 X Y pseudo
293 SLC25A6 solute carrier family 25 member 6 X Y protein-coding
438 ASMT acetylserotonin O-methyltransferase X Y protein-coding
1438 CSF2RA colony stimulating factor 2 receptor alpha subunit X Y protein-coding
3563 IL3RA interleukin 3 receptor subunit alpha X Y protein-coding
3581 IL9R interleukin 9 receptor X Y protein-coding
4267 CD99 CD99 molecule X Y protein-coding
6473 SHOX short stature homeobox X Y protein-coding
6845 VAMP7 vesicle associated membrane protein 7 X Y protein-coding
7501 XGR XG and CD99 regulator X Y other
8225 GTPBP6 GTP binding protein 6 (putative) X Y protein-coding
8227 AKAP17A A-kinase anchoring protein 17A X Y protein-coding
8623 ASMTL acetylserotonin O-methyltransferase-like X Y protein-coding
9189 ZBED1 zinc finger BED-type containing 1 X Y protein-coding
10251 SPRY3 sprouty RTK signaling antagonist 3 X Y protein-coding
28227 PPP2R3B protein phosphatase 2 regulatory subunit B''beta X Y protein-coding
55344 PLCXD1 phosphatidylinositol specific phospholipase C X domain containing 1 X Y protein-coding
64109 CRLF2 cytokine receptor-like factor 2 X Y protein-coding
80161 ASMTL-AS1 ASMTL antisense RNA 1 X Y ncRNA
207063 DHRSX dehydrogenase/reductase X-linked X Y protein-coding
283981 LINC00685 long intergenic non-protein coding RNA 685 X Y ncRNA
286530 P2RY8 purinergic receptor P2Y8 X Y protein-coding
401577 CD99P1 CD99 molecule pseudogene 1 X Y pseudo
442442 RPL14P5 ribosomal protein L14 pseudogene 5 X Y pseudo
619538 OMS otitis media, susceptibility to 10 19 3
644218 TRPC6P transient receptor potential cation channel subfamily C member 6, pseudogene X Y pseudo
652608 LOC652608 60S ribosomal protein L6-like X Y pseudo
653440 WASH6P WAS protein family homolog 6 pseudogene X Y pseudo
727856 DDX11L16 DEAD/H-box helicase 11 like 16 X Y pseudo
751580 LINC00106 long intergenic non-protein coding RNA 106 X Y ncRNA
100128260 WASIR1 WASH and IL9R antisense RNA 1 X Y ncRNA
100287692 TCEB1P24 transcription elongation factor B subunit 1 pseudogene 24 X Y pseudo
100359394 LINC00102 long intergenic non-protein coding RNA 102 X Y ncRNA
100418703 LOC100418703 repetin pseudogene X Y pseudo
100500894 MIR3690 microRNA 3690 X Y ncRNA
101928032 LOC101928032 uncharacterized LOC101928032 X Y ncRNA
101928055 LOC101928055 uncharacterized LOC101928055 X Y ncRNA
101928070 LOC101928070 uncharacterized LOC101928070 X Y ncRNA
101928092 LOC101928092 uncharacterized LOC101928092 X Y ncRNA
102464837 MIR6089 microRNA 6089 X Y ncRNA
102724521 LOC102724521 uncharacterized LOC102724521 X Y ncRNA
102725051 LOC102725051 uncharacterized LOC102725051 1 Un ncRNA
105373102 LOC105373102 uncharacterized LOC105373102 X Y protein-coding
105373105 LOC105373105 uncharacterized LOC105373105 X Y ncRNA
105379413 LOC105379413 uncharacterized LOC105379413 X Y ncRNA
105379414 LOC105379414 uncharacterized LOC105379414 X Y ncRNA
105379561 LOC105379561 uncharacterized LOC105379561 8 Un protein-coding
106478924 DHRSX-IT1 DHRSX intronic transcript 1 X Y ncRNA
106478926 DPH3P2 diphthamide biosynthesis 3 pseudogene 2 X Y pseudo
106480712 FABP5P13 fatty acid binding protein 5 pseudogene 13 X Y pseudo
106480770 RNA5SP498 RNA, 5S ribosomal pseudogene 498 X Y pseudo
107985637 LOC107985637 uncharacterized LOC107985637 X Y ncRNA
107985677 LOC107985677 uncharacterized LOC107985677 X Y ncRNA
107985697 LOC107985697 uncharacterized LOC107985697 X Y ncRNA
107985706 LOC107985706 uncharacterized LOC107985706 X Y ncRNA
@dhimmel
Copy link
Member Author

dhimmel commented Oct 7, 2016

Do we want to split these into multiple records when creating chromosome-symbol-mapper.tsv?

@cgreene
Copy link
Member

cgreene commented Oct 7, 2016

The X|Y ones are in the pseudoautosomal regions of the X and Y chromosomes. I would not be worried about those and would not split them. These should be retained.

OMS looks like a susceptibility "gene." It's not really a molecular entity, just a set of association signal regions: https://www.ncbi.nlm.nih.gov/gene/?term=619538 . This could be dropped.

The others appear to be on unplaced scaffolds: https://www.ncbi.nlm.nih.gov/gene/?term=105379561
For now, I would probably drop these too, though it's not as clear that these should be dropped as it is for something like OMS.

By the way - if someone picks a gene on the X or Y chromosomes other than those in the X|Y set, you may want to automatically detect it and build separate male and female classifiers. This is a strong signal in expression data, even for unsupervised learning.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 7, 2016

The X|Y ones are in the pseudoautosomal regions of the X and Y chromosomes. I would not be worried about those and would not split them. These should be retained.

@cgreene, we're including this file to map PANCAN_mutation (as has been done by cognoma/cancer-data#12). Therefore I looked how several of the pseudoautosomal genes were coded in that dataset:

sample chr start end reference alt gene effect DNA_VAF RNA_VAF Amino_Acid_ChangeTCGA-BH-A18P-01 chrX 1508405 1508405 G A SLC25A6 Silent p.F109
TCGA-BH-A18P-01 chrX 1508405 1508405 G A SLC25A6 Silent p.F109
TCGA-06-5416-01 chrX 1746629 1746629 C T ASMT Silent 0.276457883369
TCGA-CD-A4MI-01 chrX 1413254 1413254 G A CSF2RA Missense_Mutation p.R227H

So unless we split chromosomes, these genes will not map. I propose splitting with an optional step to include the unsplit rows. Therefore the top row would yield:

entrez_gene_id symbol description chromosome gene_type synonyms
263 AMD1P2 adenosylmethionine decarboxylase 1 pseudogene 2 X pseudo AMD
263 AMD1P2 adenosylmethionine decarboxylase 1 pseudogene 2 Y pseudo AMD
263 AMD1P2 adenosylmethionine decarboxylase 1 pseudogene 2 X Y pseudo

Do you think we should even keep the last row?

@dhimmel
Copy link
Member Author

dhimmel commented Oct 7, 2016

By the way - if someone picks a gene on the X or Y chromosomes other than those in the X|Y set, you may want to automatically detect it and build separate male and female classifiers. This is a strong signal in expression data, even for unsupervised learning.

May want to open an issue in machine-learning.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 7, 2016

The others appear to be on unplaced scaffolds. For now, I would probably drop these too, though it's not as clear that these should be dropped as it is for something like OMS.

Okay leaving these in will in effect drop them because the resource being mapped won't have that symbol-chromosome combination. No need to explicitly filter.

@cgreene
Copy link
Member

cgreene commented Oct 7, 2016

@dhimmel : for the purposes of having a resource to connect potential symbols with chromosomes, I think that retaining at least the first two lines would make the most sense. Maybe the third - I don't know how many resources use X|Y for these regions. I don't see the harm in it, so I guess my inclination would be to leave it as well.

dhimmel added a commit to dhimmel/genes that referenced this issue Oct 7, 2016
Genes with multiple chromosomes now receive multiple rows for each chromosome as
well as retaining the multi-chromosome value.
cognoma#2 (comment)

Genes with a missing value for chromosome are removed.
@dhimmel
Copy link
Member Author

dhimmel commented Oct 7, 2016

@cgreene in b64fcb4 I retained all three lines.

However, there is another issue -- some genes have no chromosome. For example:

These genes all have type unknown, so I'm guessing the inability to map them will not be a big deal. In fact they most likely won't be in our datasets?

@cgreene
Copy link
Member

cgreene commented Oct 7, 2016

These - to my knowledge - come from the expectation that there exists a gene for the disease but nobody has found it. They aren't really meaningful molecular entities and expect that you won't see them in practice.

@dhimmel dhimmel closed this as completed in 7212040 Oct 9, 2016
dhimmel added a commit to dhimmel/genes that referenced this issue Apr 7, 2018
Genes with multiple chromosomes now receive multiple rows for each chromosome as
well as retaining the multi-chromosome value.
cognoma#2 (comment)
Genes with a missing value for chromosome are removed.
dhimmel added a commit to dhimmel/genes that referenced this issue Apr 7, 2018
Download and process Entrez Gene.
Create gene identification guidelines for Project Cognoma.
Closes cognoma#2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants