Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine Learning Pathway Classifier Example #39

Merged
merged 12 commits into from
Sep 1, 2016
Merged

Conversation

gwaybio
Copy link
Member

@gwaybio gwaybio commented Aug 26, 2016

The pull request adds an example of predicting a Biological Pathway signature using PanCancer data. In the example, I predict Hippo Signaling with variable prediction across tissues. The pull request makes two contributions:

  1. Demonstrates that we can predict whole pathways using the Cognoma approach
  2. Drives home the need for tissue specific visualizations and the importance of selecting appropriate tissues for training the classifier.

gwaybio added 6 commits July 25, 2016 17:00
this commit creates an example pipeline to build a classifier to detect a Hippo signalinggene expression signature. It queries hetnet pathways and generates a Y matrix for any sample with a mutation for any gene in the hippo pathway
@gwaybio gwaybio assigned yl565 and unassigned yl565 Aug 26, 2016
@gwaybio
Copy link
Member Author

gwaybio commented Aug 26, 2016

tagging @stephenshank here too as the pull request is closely related to cognoma/cancer-data#21

@gwaybio
Copy link
Member Author

gwaybio commented Aug 26, 2016

also tagging @allaway - as a cancer biologist, he may have some insight into why predicting Hippo signaling works well in some tissues, but not in others and may also have insight into interpreting some of the top genes.


# In[5]:

def pathway_query(path_id, node_type='BiologicalProcess', particip='GpBP'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use Neo4j/Cypher parameters and avoid having to do any formatting in python

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://neo4j.com/docs/developer-manual/3.0/cypher/#cypher-parameters

May be a bit difficult with putting the rel_type into a parameter. will help you if you would like.


# Make sure the splits have equal tissue partitions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0,
stratify=clinical['disease'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think y should also be stratified considering sample size could be small for some disease type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems some disease type also have very imbalanced mutated/not mutated ratio, together with small sample size you may get very few mutated testing samples with the random split

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems some disease type also have very imbalanced mutated/not mutated ratio, together with small sample size you may get very few mutated testing samples with the random split

Nice point, i will change in the next commit

@yl565
Copy link
Contributor

yl565 commented Aug 27, 2016

Very interesting problem. Some thoughts here:
If I understand right, there are multiple genes involved here that may be mutated but the classification model do not differentiate between them. Would it be better if we build one model for each gene and combine the models in the decision fusion stage? We may also do a multivariate analysis to group genes with similar expression-mutation interaction (i.e. similar joint distribution p(X, y)) and build a model for each group.

@yl565
Copy link
Contributor

yl565 commented Aug 27, 2016

It seems to me the disease type could also be an important factor to consider. I mean p(X, y | disease) could be different depending on disease. Including (predicted) disease in the model may further improve classification performance

@dhimmel
Copy link
Member

dhimmel commented Aug 29, 2016

Would it be better if we build one model for each gene and combine the models in the decision fusion stage?

Interesting, I've never thought about a decision fusion stage. I think one assumption behind building the pathway model is that mutations anywhere in the pathway will produce a cohesive effect. In some ways, it may help you identify this cohesive effect to have all mutations grouped into a single outcome class.

It seems to me the disease type could also be an important factor to consider.

Yeah, we probably should include disease type as a covariate. Will we have to do something like get_dumies or OneHotEncoder? Note that this topic would probably be best discussed in its own issue.

@yl565
Copy link
Contributor

yl565 commented Aug 30, 2016

@dhimmel It seems to me this is a Multi-label classification problem. We have gene-expression X and we want to predict from X the disease type and the mutation probability for each gene in the pathway (i.e. a vector of y for each single sample in X). We can then fuse the predictions in a decision making stage for whatever results the cancer experts may be interested. I haven't worked with a multi-label classification problem before but this seems to be an interesting thing to try. I'm not sure if the approach could be of interest from biology point of view though. @gwaygenomics What do you think?

@cgreene
Copy link
Member

cgreene commented Aug 30, 2016

From my perspective, it's not really clear if this is a multi-label classification problem. If we think that the mutations are all doing something called 'phenocopying' each other (essentially: having more or less the same effect), then really this isn't multi-label and modeling it as such could compromise performance. Greg talked about some examples of this with NF1/Ras in the first cognoma gathering. On the other hand, if they do have distinct effects, then multi-label could be the way to go.

In summary - I don't think there's a clear answer. Particularly if you are using a multi-label approach, then one that allows much of the structure of the solution to be transferred between solutions is probably critical if you're focusing on modeling the effect in one pathway. Because the tissue-background is so different and the number of observations is low, this could compromise predictions a bit - but this is definitely a research question 👍

@dhimmel
Copy link
Member

dhimmel commented Aug 30, 2016

Just so we're all clear, here is the definition of multilabel classification in sklearn:

Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

I agree with @cgreene that multilabel may not be what we want. The motivation for using gene sets to construct the outcome is to increase the number of positives. The multilabel approach will suffer from the low prevalence of mutation for most genes.

However, @yl565 definitely feel free to give multilabel a try if you would like to play around.

I address all pull request comments in the update
nbconvert updated pathway notebook
@gwaybio
Copy link
Member Author

gwaybio commented Aug 31, 2016

@yl565 @dhimmel @cgreene I'm jumping into this discussion a bit late - but I like how we're approaching these research questions.

My naive guess would be that a classifier built on individual tumors with some sort of fusion model would perform slightly better compared to a pan cancer classifier (but also worse than an ensemble pan cancer classifer!). Of course, this depends on how the fusion is done. Could the fusion model be some sort of weighted vote of individual tissue classifiers applied to all other tissues?

However, like @dhimmel mentioned, these types of conversations are best placed in new issues with some sort of label or milestone called something like: Standing Research Question.

tissue_decision


# This filters out:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list seems redundant with the output and update-problematic (hardcoded). Would it make sense to just print out this list of filtered tissues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that would be better as well - I'll update now

@dhimmel
Copy link
Member

dhimmel commented Sep 1, 2016

Besides one last comment on hardcoded tissue list, LGTM. Squash merge at will.

@gwaybio gwaybio merged commit 5bc4315 into cognoma:master Sep 1, 2016
@gwaybio gwaybio deleted the pathway branch October 19, 2016 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants