-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map mutation gene symbols to Entrez IDs #12
Conversation
…f chromosome and gene symbol.
…f chromosome and gene symbol.
Can you export the notebooks to scripts using: jupyter nbconvert --to=script --FilesWriter.build_directory=scripts *.ipynb This will make it easier to review the changes. |
Note that the email configuration in your git, doesn't match you GitHub account email. This makes it so your commits aren't attributed to your profile. See more info here. |
Yep, your new commits are associated with your GitHub account. |
@@ -4256,7 +4256,6 @@ TCGA-E8-A3X7-01 0 8.05 8.19 9.49 5.06 10.9 8.27 10.8 7.01 9.11 3.23 6.07 0 5.62 | |||
TCGA-E8-A413-01 0 8.15 8.44 9.42 6 10.6 9.58 10.3 7.56 8.67 2.42 5.49 0 6.32 13.6 | |||
TCGA-E8-A414-01 0 7.97 8.4 9.7 5.71 10.5 9.63 10.8 7.28 8.57 2.99 5.66 0 6.97 12.8 | |||
TCGA-E8-A415-01 0 7.94 7.22 9.44 5.22 10.4 9.28 10.6 7.2 8.99 2 5.65 0 6.91 13.2 | |||
TCGA-E8-A416-01 0 7.81 6.63 9.39 4.26 10.5 9.77 8.77 7.11 9.2 2.25 6.27 0 7 13.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting -- one sample was removed, meaning they must have had few mutations, all of which didn't map.
General commentsGreat work with this pull request! I think you should separate the entrez gene processing to it's own notebook. For example, In I also think we may want to consider the following approach:
This approach gives primacy to official symbols (i.e. we don't blacklist official symbols because there's a colliding synonym on the same chromosome), but we still obliterate colliding synonyms. Does that make sense? |
Make sure to subset for |
@dhimmel These are all great points - thanks for the feedback. Would it be best to cancel/close this pull request and resubmit once the changes are made, or to keep the pull request open while I make the changes? |
I suggest keeping the pull request open. Any commits you make to your master branch will get added to this pull request. |
@dhimmel Sorry for the delay - I think I've addressed all of these points but let me know if I missed any or new ones have popped up. edit: also tagging @Inquisitive-Geek |
@clairemcleod awesome. @gwaygenomics would you like to spend ~15 minutes with the |
"metadata": {}, | ||
"source": [ | ||
"### Convert mutation gene symbol labels to Entrez IDs \n", | ||
"Goal: Relabel the mutation data frame with Entrez IDs instead of gene names, by mapping a combination of chromosome and gene symbol to Entrez ID. The NCBI file downloaded and read in the next cell contains the Entrez ID - gene symbol pairs we will use to do so." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean in the last sentence?
|
||
# In[9]: | ||
|
||
failed_mappings = (set(mutation_df.chr + '|' + mutation_df.gene) - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can avoid the string concatenation using tuples.
set(zip(mutation_df.chr, mutation_df.gene))
Looks like there were only a few small comments and then this will be ready to merge. I may be AFK, so @gwaygenomics you can do the merge when ready. I recommend a squash commit here. |
@clairemcleod @dhimmel - Looks great to me! |
This pull requests addresses Issues #4 and #6. It adds to
2.TCGA-process.ipynb
and includes a mapping of mutation gene symbol to Entrez ID as part of the processing workflow.The mapping is conducted in two stages. First, gene symbols are mapped based on the combination of chromosome # and gene symbol of record. This maps ~95% of observed mutations. Next, yet-unmapped gene symbols are mapped based on the combination of chromosome # and alternate gene symbols. Following the second mapping, ~98% of observations are mapped. The remaining ~2% were either ambiguous mappings or un-mappable; this 2% is currently discarded before writing the data out.