Map mutation gene symbols to Entrez IDs #12

clairemcleod · 2016-07-27T15:32:16Z

This pull requests addresses Issues #4 and #6. It adds to 2.TCGA-process.ipynb and includes a mapping of mutation gene symbol to Entrez ID as part of the processing workflow.

The mapping is conducted in two stages. First, gene symbols are mapped based on the combination of chromosome # and gene symbol of record. This maps ~95% of observed mutations. Next, yet-unmapped gene symbols are mapped based on the combination of chromosome # and alternate gene symbols. Following the second mapping, ~98% of observations are mapped. The remaining ~2% were either ambiguous mappings or un-mappable; this 2% is currently discarded before writing the data out.

…f chromosome and gene symbol.

dhimmel · 2016-07-27T22:33:04Z

Can you export the notebooks to scripts using:

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts *.ipynb

This will make it easier to review the changes.

dhimmel · 2016-07-27T22:35:17Z

Note that the email configuration in your git, doesn't match you GitHub account email. This makes it so your commits aren't attributed to your profile. See more info here.

…n for mutation data.

clairemcleod · 2016-07-27T23:48:11Z

@dhimmel Here is the exported script. Good catch with the email - and thanks for point it out. I think it is rectified now?

dhimmel · 2016-07-28T14:33:44Z

Yep, your new commits are associated with your GitHub account.

dhimmel · 2016-07-28T17:29:28Z

data/subset/expression-matrix-all-samples.tsv

@@ -4256,7 +4256,6 @@ TCGA-E8-A3X7-01	0	8.05	8.19	9.49	5.06	10.9	8.27	10.8	7.01	9.11	3.23	6.07	0	5.62
 TCGA-E8-A413-01	0	8.15	8.44	9.42	6	10.6	9.58	10.3	7.56	8.67	2.42	5.49	0	6.32	13.6
 TCGA-E8-A414-01	0	7.97	8.4	9.7	5.71	10.5	9.63	10.8	7.28	8.57	2.99	5.66	0	6.97	12.8
 TCGA-E8-A415-01	0	7.94	7.22	9.44	5.22	10.4	9.28	10.6	7.2	8.99	2	5.65	0	6.91	13.2
-TCGA-E8-A416-01	0	7.81	6.63	9.39	4.26	10.5	9.77	8.77	7.11	9.2	2.25	6.27	0	7	13.2


Interesting -- one sample was removed, meaning they must have had few mutations, all of which didn't map.

dhimmel · 2016-07-28T17:58:05Z

General comments

Great work with this pull request!

I think you should separate the entrez gene processing to it's own notebook. For example, 2-entrez-gene-extract.ipynb. This notebook should export one file for now (we will probably have it export more in the future) named entrez-gene-symbol-map.tsv or similar. It should have three columns: entrez_gene_id, symbol, chromosome. There should only be rows for unambigious mappings. For example, run drop_duplicates with keep=False.

In 3.TCGA-process.ipynb, we could then use the merge command with how='inner (as you're doing, but no need to combine symbol and chromosome to a single column.

I also think we may want to consider the following approach:

Construct entrez_gene_id, symbol, chromosome dataframe from only primary symbols.
Construct entrez_gene_id, symbol, chromosome dataframe from only synonyms and run drop_duplicates with keep=False.
Concatenate the dataframes from 1 and 2 and drop_duplicates with keep='first'.

This approach gives primacy to official symbols (i.e. we don't blacklist official symbols because there's a colliding synonym on the same chromosome), but we still obliterate colliding synonyms. Does that make sense?

dhimmel · 2016-07-28T18:02:51Z

Make sure to subset for tax_id = 9606 (Homo sapiens) from the get go. It's a real gotcha with the Homo_sapiens.gene_info.gz file.

clairemcleod · 2016-07-29T18:29:26Z

@dhimmel These are all great points - thanks for the feedback. Would it be best to cancel/close this pull request and resubmit once the changes are made, or to keep the pull request open while I make the changes?

dhimmel · 2016-07-29T18:36:03Z

Would it be best to cancel/close this pull request and resubmit once the changes are made, or to keep the pull request open while I make the changes?

I suggest keeping the pull request open. Any commits you make to your master branch will get added to this pull request.

clairemcleod · 2016-08-09T21:51:55Z

@dhimmel Sorry for the delay - I think I've addressed all of these points but let me know if I missed any or new ones have popped up.

edit: also tagging @Inquisitive-Geek

dhimmel · 2016-08-09T22:20:18Z

@clairemcleod awesome.

@gwaygenomics would you like to spend ~15 minutes with the cancer-data group tonight reviewing this pull request?

gwaybio · 2016-08-09T22:57:07Z

mapping/PANCAN-mutation/map-mutations.ipynb

+   "metadata": {},
+   "source": [
+    "### Convert mutation gene symbol labels to Entrez IDs  \n",
+    "Goal: Relabel the mutation data frame with Entrez IDs instead of gene names, by mapping a combination of chromosome and gene symbol to Entrez ID. The NCBI file downloaded and read in the next cell contains the Entrez ID - gene symbol pairs we will use to do so."


Can you clarify what you mean in the last sentence?

dhimmel · 2016-08-10T13:59:31Z

mapping/PANCAN-mutation/scripts/map-mutations.py

+
+# In[9]:
+
+failed_mappings = (set(mutation_df.chr + '|' + mutation_df.gene) - 


Can avoid the string concatenation using tuples.

set(zip(mutation_df.chr, mutation_df.gene))

dhimmel · 2016-08-10T14:02:40Z

Looks like there were only a few small comments and then this will be ready to merge.

I may be AFK, so @gwaygenomics you can do the merge when ready. I recommend a squash commit here.

gwaybio · 2016-08-10T18:33:36Z

@clairemcleod @dhimmel - Looks great to me!

Claire McLeod and others added 2 commits July 27, 2016 11:19

In mutation data, map gene symbol to Entrez ID based on combination o…

4dc8cac

…f chromosome and gene symbol.

In mutation data, map gene symbol to Entrez ID based on combination o…

939ec6e

…f chromosome and gene symbol.

clairemcleod force-pushed the master branch from 4dc8cac to 939ec6e Compare July 27, 2016 23:37

clairemcleod added 2 commits July 27, 2016 19:40

Create python script that includes gene symbol -> entrez id conversio…

0c80c2a

…n for mutation data.

Merge branch 'master' of https://github.com/clairemcleod/cancer-data

050929d

dhimmel reviewed Jul 28, 2016
View reviewed changes

clairemcleod added 3 commits August 9, 2016 17:43

Move mutation gene -> entrez mapping to separate notebook.

d8d0d5a

Include mutation mapping .tsv file

e9e16c5

Include small data matrices.

ac1da8c

gwaybio reviewed Aug 9, 2016
View reviewed changes

dhimmel reviewed Aug 10, 2016
View reviewed changes

Added comments and removed some string concatenation.

0d8c752

gwaybio merged commit e6a7fcf into cognoma:master Aug 10, 2016

dhimmel mentioned this pull request Aug 26, 2016

Converting Xena datasets to standard identifiers rather than gene symbols #6

Closed

dhimmel mentioned this pull request Sep 7, 2016

Export a gene information table #23

Closed

This was referenced Oct 7, 2016

Download and process Entrez Gene cognoma/genes#1

Merged

Genes with multiple chromosomes cognoma/genes#2

Closed

jjc2718 mentioned this pull request Mar 6, 2022

GO functional enrichment analysis of cancer gene sets greenelab/mpmp#82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map mutation gene symbols to Entrez IDs #12

Map mutation gene symbols to Entrez IDs #12

clairemcleod commented Jul 27, 2016

dhimmel commented Jul 27, 2016 •

edited

Loading

dhimmel commented Jul 27, 2016

clairemcleod commented Jul 27, 2016

dhimmel commented Jul 28, 2016

dhimmel Jul 28, 2016

dhimmel commented Jul 28, 2016

dhimmel commented Jul 28, 2016

clairemcleod commented Jul 29, 2016

dhimmel commented Jul 29, 2016

clairemcleod commented Aug 9, 2016 •

edited

Loading

dhimmel commented Aug 9, 2016

gwaybio Aug 9, 2016

dhimmel Aug 10, 2016

dhimmel commented Aug 10, 2016

gwaybio commented Aug 10, 2016


		# In[9]:

		failed_mappings = (set(mutation_df.chr + '\|' + mutation_df.gene) -

Map mutation gene symbols to Entrez IDs #12

Map mutation gene symbols to Entrez IDs #12

Conversation

clairemcleod commented Jul 27, 2016

dhimmel commented Jul 27, 2016 • edited Loading

dhimmel commented Jul 27, 2016

clairemcleod commented Jul 27, 2016

dhimmel commented Jul 28, 2016

dhimmel Jul 28, 2016

Choose a reason for hiding this comment

dhimmel commented Jul 28, 2016

General comments

dhimmel commented Jul 28, 2016

clairemcleod commented Jul 29, 2016

dhimmel commented Jul 29, 2016

clairemcleod commented Aug 9, 2016 • edited Loading

dhimmel commented Aug 9, 2016

gwaybio Aug 9, 2016

Choose a reason for hiding this comment

dhimmel Aug 10, 2016

Choose a reason for hiding this comment

dhimmel commented Aug 10, 2016

gwaybio commented Aug 10, 2016

dhimmel commented Jul 27, 2016 •

edited

Loading

clairemcleod commented Aug 9, 2016 •

edited

Loading