-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interface with SAMM or bypass Galton branching #104
Comments
I'm not very familiar with the workings of SAMM, but hopefully I can help answer your questions about gctree.
It is possible to change how trees are ranked, by passing three ranking coefficients to the
The inference pipeline outputs top-ranked trees as pickled If you want access to all the parsimony trees that are being ranked, not just the best ones with respect to ranking criteria, you can extract trees from the pickled Depending on your data, the pickled forest may contain too many trees to iterate over. If this is the case, there are ways to filter them efficiently according to criteria you care about (although this is getting into internals and won't be straightforward), or you can call
Unfortunately it's not possible to run
Yes. Gctree uses something we're calling mutability parsimony, which is a sum over tree branches of a sum over sites of the negative log of the mutability of the central base in its 5-mer in the parent sequence, times the targeting probability of the child base given that 5-mer. Sites without mutations contribute no weight to this score, but it attempts to give higher weight to mutations which are less likely according to a mutation model. It isn't really a likelihood, though. What SAMM does is more principled. I hope this helps, but let me know if you have more questions. |
Oh, and would you mind describing your use case, @Lan-h ? |
Thank you very much for these quick replies! @matsen, I chose mutability based ranking of trees based on the BCR trees benchmarking paper published in JI in 2018, but maybe the methods have been updated since then... For the moment, I'm preprocessing the sequences and generating the parsimony forest with Gctree which has convenient functions. @willdumm, I tried both ranking coefficients within GCtree and switching to SAMM
I only get one tree (gctree.out.inference.1.nk), even though several of them are mentioned on the --tree_stats output newranking.tree_stats.csv. Also, the --tree_stats output is incompatible with the generation of single tree files (nk, svg, .p ...), I do not know whether this is some kind of bug. Concerning the switch to SAMM, I tried to read the .p file as in the post above:
Here is the code that I'm using within a python script:
This was not enough to have access to the content of the .p file and got pickle protocol error messages. It might be due to incompatibilities between Python versions in Samm and Gctree, I'll try to figure out a way to make it work. I remain at your disposal for any further information |
@Lan-h Sorry for the delay in responding. I suspect you were indeed unable to unpickle the parsimony forest due to Python version issues. I'm curious if you have any trouble loading the pickled forest in a Python 3.9 environment with the latest version of gctree installed? Gctree is Python 3 only, and SAMM is Python 2 only, so it's possible that to export a tree from gctree to SAMM you'll need to write it to newick and re-load it in the Python 2 environment in which you're running SAMM. The CollapsedTree object contains an attribute
I'm not sure what you mean by incompatible, but the output from Let me know if you have any other questions. I'll be more responsive next time. |
Thank you for your answer. I'm running the script either from the docker from the tutorial https://matsengrp.github.io/gctree/install.html or a custom docker with both python versions, SAMM, Gctree ... What I meant by incompatible is that when I add the --tree_stats option, I only get the tree_stats.csv output (and no gctree.out.inference files .p, .svg, .nk, .fasta) even if it mentions 40 trees with similar scores. I have to launch it again without the --tree_stat option to be able to get the other outputs. This is not my main issue though.... I am really surprised to get only one tree, since many of them are equally likely, according to the tree_stat.csv. Also, I am running some tests on a dataset on which a collegue has run gctree 2 years ago and obtained 15 different gctree.out.inferences. I wonder if there has been a major change in the gctree package since then. According to the tutorial, "If more than one parsimony tree is optimal, then up to ten optimal trees will be sampled randomly". Is there a way to extend this list, and what is the score range that is allowed for these trees to be sampled ? |
This is definitely a bug. I'll be sure to fix it in the next release
Ah yes I see now what you mean. Since version 4, gctree uses a new internal representation of collections of trees which makes it very efficient to find the optimal trees. This representation also finds sometimes many more maximally parsimonious trees than dnapars does. This means that sometimes there are thousands of trees which all maximize the likelihood. In order to avoid outputting possibly thousands of equally optimal trees, we switched to sampling one tree from each optimal topology class. In your case, this means that although If you'd like to recover the previous gctree behavior immediately, just revert to a version before v4.0.0.
Only trees that optimize the branching process likelihood (or whatever ranking criteria you're using) will be output at the end of inference. If you want to retain trees which have suboptimal branching process likelihood, you will need to filter them one-by-one by iterating through the pickled forest object. I hope this is all clear, let me know if you have any more questions! |
Hi,
Thank you for your tool.
I am more interested in mutability likelihood than the one computed by default after inferring the parameters of the galton branching process.
I tried to use the .p file (either returned by phylip_parse or gc infer) but it seems the other tool from your lab, SAMM (https://github.com/matsengrp/samm), cannot take as an input a whole parsimony forest.
Is there a way to parse the .p file to single trees that could be processed by SAMM or use gc infer mutability options without computing the 2 galton parmeters (my trees are quite huge, and this step can be time intensive)?
By the way, is there a difference between the mutability likelihoods computed by SAMM and gctree ?
Thank you in advance for your help!
The text was updated successfully, but these errors were encountered: