Machine Learning Pathway Classifier Example #39

gwaybio · 2016-08-26T18:57:01Z

The pull request adds an example of predicting a Biological Pathway signature using PanCancer data. In the example, I predict Hippo Signaling with variable prediction across tissues. The pull request makes two contributions:

Demonstrates that we can predict whole pathways using the Cognoma approach
Drives home the need for tissue specific visualizations and the importance of selecting appropriate tissues for training the classifier.

this commit creates an example pipeline to build a classifier to detect a Hippo signalinggene expression signature. It queries hetnet pathways and generates a Y matrix for any sample with a mutation for any gene in the hippo pathway

gwaybio · 2016-08-26T19:00:04Z

tagging @stephenshank here too as the pull request is closely related to cognoma/cancer-data#21

gwaybio · 2016-08-26T19:18:20Z

also tagging @allaway - as a cancer biologist, he may have some insight into why predicting Hippo signaling works well in some tissues, but not in others and may also have insight into interpreting some of the top genes.

dhimmel · 2016-08-26T19:27:30Z

scripts/2.TCGA-MLexample_Pathway.py

+
+# In[5]:
+
+def pathway_query(path_id, node_type='BiologicalProcess', particip='GpBP'):


Can use Neo4j/Cypher parameters and avoid having to do any formatting in python

what do you mean?

https://neo4j.com/docs/developer-manual/3.0/cypher/#cypher-parameters

May be a bit difficult with putting the rel_type into a parameter. will help you if you would like.

yl565 · 2016-08-27T20:36:42Z

scripts/2.TCGA-MLexample_Pathway.py

+
+# Make sure the splits have equal tissue partitions
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0,
+                                                    stratify=clinical['disease'])


I think y should also be stratified considering sample size could be small for some disease type

Seems some disease type also have very imbalanced mutated/not mutated ratio, together with small sample size you may get very few mutated testing samples with the random split

Seems some disease type also have very imbalanced mutated/not mutated ratio, together with small sample size you may get very few mutated testing samples with the random split

Nice point, i will change in the next commit

yl565 · 2016-08-27T21:15:25Z

Very interesting problem. Some thoughts here:
If I understand right, there are multiple genes involved here that may be mutated but the classification model do not differentiate between them. Would it be better if we build one model for each gene and combine the models in the decision fusion stage? We may also do a multivariate analysis to group genes with similar expression-mutation interaction (i.e. similar joint distribution p(X, y)) and build a model for each group.

yl565 · 2016-08-27T21:40:53Z

It seems to me the disease type could also be an important factor to consider. I mean p(X, y | disease) could be different depending on disease. Including (predicted) disease in the model may further improve classification performance

dhimmel · 2016-08-29T17:44:24Z

Would it be better if we build one model for each gene and combine the models in the decision fusion stage?

Interesting, I've never thought about a decision fusion stage. I think one assumption behind building the pathway model is that mutations anywhere in the pathway will produce a cohesive effect. In some ways, it may help you identify this cohesive effect to have all mutations grouped into a single outcome class.

It seems to me the disease type could also be an important factor to consider.

Yeah, we probably should include disease type as a covariate. Will we have to do something like get_dumies or OneHotEncoder? Note that this topic would probably be best discussed in its own issue.

yl565 · 2016-08-30T12:40:31Z

@dhimmel It seems to me this is a Multi-label classification problem. We have gene-expression X and we want to predict from X the disease type and the mutation probability for each gene in the pathway (i.e. a vector of y for each single sample in X). We can then fuse the predictions in a decision making stage for whatever results the cancer experts may be interested. I haven't worked with a multi-label classification problem before but this seems to be an interesting thing to try. I'm not sure if the approach could be of interest from biology point of view though. @gwaygenomics What do you think?

cgreene · 2016-08-30T12:50:42Z

From my perspective, it's not really clear if this is a multi-label classification problem. If we think that the mutations are all doing something called 'phenocopying' each other (essentially: having more or less the same effect), then really this isn't multi-label and modeling it as such could compromise performance. Greg talked about some examples of this with NF1/Ras in the first cognoma gathering. On the other hand, if they do have distinct effects, then multi-label could be the way to go.

In summary - I don't think there's a clear answer. Particularly if you are using a multi-label approach, then one that allows much of the structure of the solution to be transferred between solutions is probably critical if you're focusing on modeling the effect in one pathway. Because the tissue-background is so different and the number of observations is low, this could compromise predictions a bit - but this is definitely a research question 👍

dhimmel · 2016-08-30T14:14:36Z

Just so we're all clear, here is the definition of multilabel classification in sklearn:

Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

I agree with @cgreene that multilabel may not be what we want. The motivation for using gene sets to construct the outcome is to increase the number of positives. The multilabel approach will suffer from the low prevalence of mutation for most genes.

However, @yl565 definitely feel free to give multilabel a try if you would like to play around.

I address all pull request comments in the update

nbconvert updated pathway notebook

gwaybio · 2016-08-31T17:34:40Z

@yl565 @dhimmel @cgreene I'm jumping into this discussion a bit late - but I like how we're approaching these research questions.

My naive guess would be that a classifier built on individual tumors with some sort of fusion model would perform slightly better compared to a pan cancer classifier (but also worse than an ensemble pan cancer classifer!). Of course, this depends on how the fusion is done. Could the fusion model be some sort of weighted vote of individual tissue classifiers applied to all other tissues?

However, like @dhimmel mentioned, these types of conversations are best placed in new issues with some sort of label or milestone called something like: Standing Research Question.

dhimmel · 2016-09-01T13:47:02Z

scripts/3.TCGA-MLexample_Pathway.py

+tissue_decision
+
+
+# This filters out:


This list seems redundant with the output and update-problematic (hardcoded). Would it make sense to just print out this list of filtered tissues?

Yeah, I think that would be better as well - I'll update now

dhimmel · 2016-09-01T13:49:20Z

Besides one last comment on hardcoded tissue list, LGTM. Squash merge at will.

gwaybio added 6 commits July 25, 2016 17:00

adding machine learning example

61899e7

updating ml example for pull request comments

ee121d3

Merge remote-tracking branch 'upstream/master'

f020ef4

updating ml example 1

1904209

add neo4j-driver to environment

9360853

add neo4j pathway example

f93ec0e

this commit creates an example pipeline to build a classifier to detect a Hippo signalinggene expression signature. It queries hetnet pathways and generates a Y matrix for any sample with a mutation for any gene in the hippo pathway

gwaybio assigned yl565 and unassigned yl565 Aug 26, 2016

nbconvert example to .py

48dc594

dhimmel reviewed Aug 26, 2016
View reviewed changes

dhimmel mentioned this pull request Aug 26, 2016

Precomputing a sample × mutation-in-gene-set matrix cognoma/cancer-data#21

Open

yl565 reviewed Aug 27, 2016
View reviewed changes

gwaybio added 2 commits August 31, 2016 13:15

update pathway script

96000e1

I address all pull request comments in the update

update pathway example py

cca1826

nbconvert updated pathway notebook

dhimmel mentioned this pull request Aug 31, 2016

Automate cancer-data download from figshare #42

Merged

gwaybio added 2 commits August 31, 2016 16:42

Merge remote-tracking branch 'upstream/master' into pathway

1d94598

remove download logic and rename

deac6a7

dhimmel reviewed Sep 1, 2016
View reviewed changes

remove hardcode tissue filter output

c245db3

gwaybio merged commit 5bc4315 into cognoma:master Sep 1, 2016

gwaybio deleted the pathway branch October 19, 2016 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine Learning Pathway Classifier Example #39

Machine Learning Pathway Classifier Example #39

gwaybio commented Aug 26, 2016

gwaybio commented Aug 26, 2016

gwaybio commented Aug 26, 2016

dhimmel Aug 26, 2016

gwaybio Aug 31, 2016

dhimmel Aug 31, 2016

yl565 Aug 27, 2016

yl565 Aug 27, 2016

gwaybio Aug 31, 2016

yl565 commented Aug 27, 2016 •

edited

Loading

yl565 commented Aug 27, 2016

dhimmel commented Aug 29, 2016

yl565 commented Aug 30, 2016

cgreene commented Aug 30, 2016

dhimmel commented Aug 30, 2016

gwaybio commented Aug 31, 2016

dhimmel Sep 1, 2016

gwaybio Sep 1, 2016

dhimmel commented Sep 1, 2016


		# In[5]:

		def pathway_query(path_id, node_type='BiologicalProcess', particip='GpBP'):

Machine Learning Pathway Classifier Example #39

Machine Learning Pathway Classifier Example #39

Conversation

gwaybio commented Aug 26, 2016

gwaybio commented Aug 26, 2016

gwaybio commented Aug 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yl565 commented Aug 27, 2016 • edited Loading

yl565 commented Aug 27, 2016

dhimmel commented Aug 29, 2016

yl565 commented Aug 30, 2016

cgreene commented Aug 30, 2016

dhimmel commented Aug 30, 2016

gwaybio commented Aug 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel commented Sep 1, 2016

yl565 commented Aug 27, 2016 •

edited

Loading