Genes with multiple chromosomes #2

dhimmel · 2016-10-07T01:50:56Z

What does it mean for a gene to have multiple chromosomes? Here are all the genes from genes.tsv that exhibited multiple chromosomes:

entrez_gene_id	symbol	description	chromosome	gene_type	synonyms
263	AMD1P2	adenosylmethionine decarboxylase 1 pseudogene 2	X	Y	pseudo
293	SLC25A6	solute carrier family 25 member 6	X	Y	protein-coding
438	ASMT	acetylserotonin O-methyltransferase	X	Y	protein-coding
1438	CSF2RA	colony stimulating factor 2 receptor alpha subunit	X	Y	protein-coding
3563	IL3RA	interleukin 3 receptor subunit alpha	X	Y	protein-coding
3581	IL9R	interleukin 9 receptor	X	Y	protein-coding
4267	CD99	CD99 molecule	X	Y	protein-coding
6473	SHOX	short stature homeobox	X	Y	protein-coding
6845	VAMP7	vesicle associated membrane protein 7	X	Y	protein-coding
7501	XGR	XG and CD99 regulator	X	Y	other
8225	GTPBP6	GTP binding protein 6 (putative)	X	Y	protein-coding
8227	AKAP17A	A-kinase anchoring protein 17A	X	Y	protein-coding
8623	ASMTL	acetylserotonin O-methyltransferase-like	X	Y	protein-coding
9189	ZBED1	zinc finger BED-type containing 1	X	Y	protein-coding
10251	SPRY3	sprouty RTK signaling antagonist 3	X	Y	protein-coding
28227	PPP2R3B	protein phosphatase 2 regulatory subunit B''beta	X	Y	protein-coding
55344	PLCXD1	phosphatidylinositol specific phospholipase C X domain containing 1	X	Y	protein-coding
64109	CRLF2	cytokine receptor-like factor 2	X	Y	protein-coding
80161	ASMTL-AS1	ASMTL antisense RNA 1	X	Y	ncRNA
207063	DHRSX	dehydrogenase/reductase X-linked	X	Y	protein-coding
283981	LINC00685	long intergenic non-protein coding RNA 685	X	Y	ncRNA
286530	P2RY8	purinergic receptor P2Y8	X	Y	protein-coding
401577	CD99P1	CD99 molecule pseudogene 1	X	Y	pseudo
442442	RPL14P5	ribosomal protein L14 pseudogene 5	X	Y	pseudo
619538	OMS	otitis media, susceptibility to	10	19	3
644218	TRPC6P	transient receptor potential cation channel subfamily C member 6, pseudogene	X	Y	pseudo
652608	LOC652608	60S ribosomal protein L6-like	X	Y	pseudo
653440	WASH6P	WAS protein family homolog 6 pseudogene	X	Y	pseudo
727856	DDX11L16	DEAD/H-box helicase 11 like 16	X	Y	pseudo
751580	LINC00106	long intergenic non-protein coding RNA 106	X	Y	ncRNA
100128260	WASIR1	WASH and IL9R antisense RNA 1	X	Y	ncRNA
100287692	TCEB1P24	transcription elongation factor B subunit 1 pseudogene 24	X	Y	pseudo
100359394	LINC00102	long intergenic non-protein coding RNA 102	X	Y	ncRNA
100418703	LOC100418703	repetin pseudogene	X	Y	pseudo
100500894	MIR3690	microRNA 3690	X	Y	ncRNA
101928032	LOC101928032	uncharacterized LOC101928032	X	Y	ncRNA
101928055	LOC101928055	uncharacterized LOC101928055	X	Y	ncRNA
101928070	LOC101928070	uncharacterized LOC101928070	X	Y	ncRNA
101928092	LOC101928092	uncharacterized LOC101928092	X	Y	ncRNA
102464837	MIR6089	microRNA 6089	X	Y	ncRNA
102724521	LOC102724521	uncharacterized LOC102724521	X	Y	ncRNA
102725051	LOC102725051	uncharacterized LOC102725051	1	Un	ncRNA
105373102	LOC105373102	uncharacterized LOC105373102	X	Y	protein-coding
105373105	LOC105373105	uncharacterized LOC105373105	X	Y	ncRNA
105379413	LOC105379413	uncharacterized LOC105379413	X	Y	ncRNA
105379414	LOC105379414	uncharacterized LOC105379414	X	Y	ncRNA
105379561	LOC105379561	uncharacterized LOC105379561	8	Un	protein-coding
106478924	DHRSX-IT1	DHRSX intronic transcript 1	X	Y	ncRNA
106478926	DPH3P2	diphthamide biosynthesis 3 pseudogene 2	X	Y	pseudo
106480712	FABP5P13	fatty acid binding protein 5 pseudogene 13	X	Y	pseudo
106480770	RNA5SP498	RNA, 5S ribosomal pseudogene 498	X	Y	pseudo
107985637	LOC107985637	uncharacterized LOC107985637	X	Y	ncRNA
107985677	LOC107985677	uncharacterized LOC107985677	X	Y	ncRNA
107985697	LOC107985697	uncharacterized LOC107985697	X	Y	ncRNA
107985706	LOC107985706	uncharacterized LOC107985706	X	Y	ncRNA

The text was updated successfully, but these errors were encountered:

dhimmel · 2016-10-07T01:52:13Z

Do we want to split these into multiple records when creating chromosome-symbol-mapper.tsv?

cgreene · 2016-10-07T12:37:33Z

The X|Y ones are in the pseudoautosomal regions of the X and Y chromosomes. I would not be worried about those and would not split them. These should be retained.

OMS looks like a susceptibility "gene." It's not really a molecular entity, just a set of association signal regions: https://www.ncbi.nlm.nih.gov/gene/?term=619538 . This could be dropped.

The others appear to be on unplaced scaffolds: https://www.ncbi.nlm.nih.gov/gene/?term=105379561
For now, I would probably drop these too, though it's not as clear that these should be dropped as it is for something like OMS.

By the way - if someone picks a gene on the X or Y chromosomes other than those in the X|Y set, you may want to automatically detect it and build separate male and female classifiers. This is a strong signal in expression data, even for unsupervised learning.

dhimmel · 2016-10-07T14:34:33Z

The X|Y ones are in the pseudoautosomal regions of the X and Y chromosomes. I would not be worried about those and would not split them. These should be retained.

@cgreene, we're including this file to map PANCAN_mutation (as has been done by cognoma/cancer-data#12). Therefore I looked how several of the pseudoautosomal genes were coded in that dataset:

sample	chr	start	end	reference	alt	gene	effect	DNA_VAF	RNA_VAF	Amino_Acid_ChangeTCGA-BH-A18P-01
TCGA-BH-A18P-01	chrX	1508405	1508405	G	A	SLC25A6	Silent		p.F109
TCGA-06-5416-01	chrX	1746629	1746629	C	T	ASMT	Silent	0.276457883369
TCGA-CD-A4MI-01	chrX	1413254	1413254	G	A	CSF2RA	Missense_Mutation			p.R227H

So unless we split chromosomes, these genes will not map. I propose splitting with an optional step to include the unsplit rows. Therefore the top row would yield:

entrez_gene_id	symbol	description	chromosome	gene_type	synonyms
263	AMD1P2	adenosylmethionine decarboxylase 1 pseudogene 2	X	pseudo	AMD
263	AMD1P2	adenosylmethionine decarboxylase 1 pseudogene 2	Y	pseudo	AMD
263	AMD1P2	adenosylmethionine decarboxylase 1 pseudogene 2	X	Y	pseudo

Do you think we should even keep the last row?

dhimmel · 2016-10-07T14:35:50Z

By the way - if someone picks a gene on the X or Y chromosomes other than those in the X|Y set, you may want to automatically detect it and build separate male and female classifiers. This is a strong signal in expression data, even for unsupervised learning.

May want to open an issue in machine-learning.

dhimmel · 2016-10-07T14:37:19Z

The others appear to be on unplaced scaffolds. For now, I would probably drop these too, though it's not as clear that these should be dropped as it is for something like OMS.

Okay leaving these in will in effect drop them because the resource being mapped won't have that symbol-chromosome combination. No need to explicitly filter.

cgreene · 2016-10-07T18:08:00Z

@dhimmel : for the purposes of having a resource to connect potential symbols with chromosomes, I think that retaining at least the first two lines would make the most sense. Maybe the third - I don't know how many resources use X|Y for these regions. I don't see the harm in it, so I guess my inclination would be to leave it as well.

Genes with multiple chromosomes now receive multiple rows for each chromosome as well as retaining the multi-chromosome value. cognoma#2 (comment) Genes with a missing value for chromosome are removed.

dhimmel · 2016-10-07T20:34:20Z

@cgreene in b64fcb4 I retained all three lines.

However, there is another issue -- some genes have no chromosome. For example:

These genes all have type unknown, so I'm guessing the inability to map them will not be a big deal. In fact they most likely won't be in our datasets?

cgreene · 2016-10-07T21:06:20Z

These - to my knowledge - come from the expectation that there exists a gene for the disease but nobody has found it. They aren't really meaningful molecular entities and expect that you won't see them in practice.

Genes with multiple chromosomes now receive multiple rows for each chromosome as well as retaining the multi-chromosome value. cognoma#2 (comment) Genes with a missing value for chromosome are removed.

Download and process Entrez Gene. Create gene identification guidelines for Project Cognoma. Closes cognoma#2.

dhimmel closed this as completed in 7212040 Oct 9, 2016

dhimmel added a commit to dhimmel/genes that referenced this issue Apr 7, 2018

Merge pull request cognoma#1 from dhimmel/entrez

ffc3ddb

Download and process Entrez Gene. Create gene identification guidelines for Project Cognoma. Closes cognoma#2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genes with multiple chromosomes #2

Genes with multiple chromosomes #2

dhimmel commented Oct 7, 2016 •

edited

Loading

dhimmel commented Oct 7, 2016

cgreene commented Oct 7, 2016

dhimmel commented Oct 7, 2016

dhimmel commented Oct 7, 2016

dhimmel commented Oct 7, 2016

cgreene commented Oct 7, 2016

dhimmel commented Oct 7, 2016

cgreene commented Oct 7, 2016 •

edited by dhimmel

Loading

Genes with multiple chromosomes #2

Genes with multiple chromosomes #2

Comments

dhimmel commented Oct 7, 2016 • edited Loading

dhimmel commented Oct 7, 2016

cgreene commented Oct 7, 2016

dhimmel commented Oct 7, 2016

dhimmel commented Oct 7, 2016

dhimmel commented Oct 7, 2016

cgreene commented Oct 7, 2016

dhimmel commented Oct 7, 2016

cgreene commented Oct 7, 2016 • edited by dhimmel Loading

dhimmel commented Oct 7, 2016 •

edited

Loading

cgreene commented Oct 7, 2016 •

edited by dhimmel

Loading