-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine Learning Pathway Classifier Example #39
Conversation
this commit creates an example pipeline to build a classifier to detect a Hippo signalinggene expression signature. It queries hetnet pathways and generates a Y matrix for any sample with a mutation for any gene in the hippo pathway
tagging @stephenshank here too as the pull request is closely related to cognoma/cancer-data#21 |
also tagging @allaway - as a cancer biologist, he may have some insight into why predicting Hippo signaling works well in some tissues, but not in others and may also have insight into interpreting some of the top genes. |
|
||
# In[5]: | ||
|
||
def pathway_query(path_id, node_type='BiologicalProcess', particip='GpBP'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use Neo4j/Cypher parameters and avoid having to do any formatting in python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://neo4j.com/docs/developer-manual/3.0/cypher/#cypher-parameters
May be a bit difficult with putting the rel_type into a parameter. will help you if you would like.
|
||
# Make sure the splits have equal tissue partitions | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0, | ||
stratify=clinical['disease']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think y
should also be stratified considering sample size could be small for some disease type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems some disease type also have very imbalanced mutated/not mutated ratio, together with small sample size you may get very few mutated testing samples with the random split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems some disease type also have very imbalanced mutated/not mutated ratio, together with small sample size you may get very few mutated testing samples with the random split
Nice point, i will change in the next commit
Very interesting problem. Some thoughts here: |
It seems to me the disease type could also be an important factor to consider. I mean p(X, y | disease) could be different depending on disease. Including (predicted) disease in the model may further improve classification performance |
Interesting, I've never thought about a decision fusion stage. I think one assumption behind building the pathway model is that mutations anywhere in the pathway will produce a cohesive effect. In some ways, it may help you identify this cohesive effect to have all mutations grouped into a single outcome class.
Yeah, we probably should include disease type as a covariate. Will we have to do something like |
@dhimmel It seems to me this is a Multi-label classification problem. We have gene-expression X and we want to predict from X the disease type and the mutation probability for each gene in the pathway (i.e. a vector of y for each single sample in X). We can then fuse the predictions in a decision making stage for whatever results the cancer experts may be interested. I haven't worked with a multi-label classification problem before but this seems to be an interesting thing to try. I'm not sure if the approach could be of interest from biology point of view though. @gwaygenomics What do you think? |
From my perspective, it's not really clear if this is a multi-label classification problem. If we think that the mutations are all doing something called 'phenocopying' each other (essentially: having more or less the same effect), then really this isn't multi-label and modeling it as such could compromise performance. Greg talked about some examples of this with NF1/Ras in the first cognoma gathering. On the other hand, if they do have distinct effects, then multi-label could be the way to go. In summary - I don't think there's a clear answer. Particularly if you are using a multi-label approach, then one that allows much of the structure of the solution to be transferred between solutions is probably critical if you're focusing on modeling the effect in one pathway. Because the tissue-background is so different and the number of observations is low, this could compromise predictions a bit - but this is definitely a research question 👍 |
Just so we're all clear, here is the definition of multilabel classification in sklearn:
I agree with @cgreene that multilabel may not be what we want. The motivation for using gene sets to construct the outcome is to increase the number of positives. The multilabel approach will suffer from the low prevalence of mutation for most genes. However, @yl565 definitely feel free to give multilabel a try if you would like to play around. |
I address all pull request comments in the update
nbconvert updated pathway notebook
@yl565 @dhimmel @cgreene I'm jumping into this discussion a bit late - but I like how we're approaching these research questions. My naive guess would be that a classifier built on individual tumors with some sort of fusion model would perform slightly better compared to a pan cancer classifier (but also worse than an ensemble pan cancer classifer!). Of course, this depends on how the fusion is done. Could the fusion model be some sort of weighted vote of individual tissue classifiers applied to all other tissues? However, like @dhimmel mentioned, these types of conversations are best placed in new issues with some sort of label or milestone called something like: |
tissue_decision | ||
|
||
|
||
# This filters out: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list seems redundant with the output and update-problematic (hardcoded). Would it make sense to just print out this list of filtered tissues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that would be better as well - I'll update now
Besides one last comment on hardcoded tissue list, LGTM. Squash merge at will. |
The pull request adds an example of predicting a Biological Pathway signature using PanCancer data. In the example, I predict Hippo Signaling with variable prediction across tissues. The pull request makes two contributions: