-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precomputing a sample × mutation-in-gene-set matrix #21
Comments
I think this would be the next logical step for the cancer data group - and like @stephenshank mentioned, would require some communication with the ML group. I did some work on this issue today and am shooting to file a pull request in the ML group tomorrow afternoon.
From my perspective, you can think of this matrix as very similar to the gene-based mutation matrix except with the gene names as columns, there will be pathways.
Tweaking to the actual classifier is extremely minimal. The algorithm will simply take in a Y matrix of {0,1} where 1 means a mutation in any gene in the pathway. The visualizations of input data and classifier performance on a per tissue basis is where this approach is likely to have the most difference |
I think that the long-term aim of this part is to do queries to the live hetnet database to return a gene set. This way, whenever the hetnets get updated, we automatically get the improved versions. It may be best to start there (queries against the live hetnets) instead of a downloaded version. |
Agreed, but I think there is an R&D argument for generating a sample by pathway matrix. For example, we will want to know the distribution of positive prevalence across all pathways. @stephenshank, if you're still interested in this task, I recommend it. It will be convenient to have a cached mutation matrix for gene sets rather than genes. You can still work with Hetionet Cypher queries to construct this dataset, as @gwaygenomics started in cognoma/machine-learning#39. |
Also interesting is how often does Hetionet return genes that aren't in our mutation dataset. |
@dhimmel I believe I'm ready to submit a PR for this, but had one quick question. The resulting sample-pathway matrix is about 26 MB uncompressed. I wasn't sure how big was too big to track, or if we want to track compressed files. Any suggestions would be most appreciated. |
Can you |
See #25. |
At the 8/23 meetup, @dhimmel expressed interest in incorporating metabolic pathway information by combining the dataset that we have and the hetnet database that was described at the first meetup. The hetnet has information on what pathways the mutated genes in the current dataset participate in.
I figured I'd open this issue to get the conversation started. Initially, I am wondering what this dataset would look like, and do we envision it being created from what we already have? And how much tweaking will the classifier of the machine learning group (for instance, that provided by @gwaygenomics) require?
The text was updated successfully, but these errors were encountered: