-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choosing the optimal way to do exploratory analysis #1770
Comments
Dear @Malaevleo, Couple of things.
If you use the results of Generally, the best strategy to boost power is to use some non-sequence based information to define your hypothesis as precisely as possible: select a subject of candidate genes, or a subset of candidate branches a priori. HTH, |
Thank you very much for great advice! Pipeline of using BUSTED-E and then inputting the identified genes into aBSREL sounds like the best possible option for me. I guess there is no point in also running MEME as these methods are sufficient. I actually expect to find only a few genes, so everything sounds fairly reasonable. I can choose as test branches only those that are for long-living bats (there are only 6 of them out of 18 overall species), therefore making hypothesis more strict and boosting statistical power of methods. But I am not sure that other bats do not experience positive selection in these genes. Thus, for aBSREL I am more confident in testing all branches and then choosing genes that experience positive selection only in the branches of interest and nowhere else. I've read this article (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-023-09554-4#Sec12) and here the researchers straight up remove test branches and run M8 vs M8a on the alignments without long-living species to determine genes that experience positive selection in the non long-living species. I am not exactly confident in this approach. In the article from Bat1K consortium (https://www.nature.com/articles/s41586-020-2486-3#Sec10) upon doing aBSREL researchers used all brances as test ones and then chose genes that experience positive selection only in bat common ancestor. This seems more fair in my opinion but maybe I am wrong. To summarize, while it is possible to define only a subset of branches as test ones, I can not be certain that this genes do not experience positive selection in other branches, therefore I am more confident in doing exploratory analysis (even though power will be lost) with all branches being tested and then choosing only those genes that experience positive selection in branches that I am interested in. Is my approach reasonable or am I wrong? Also, on the note of BUSTED-E just to clarify: am I right that to run it I have to, firstly, use Once again, thank you for your attention and I am really grateful for great advice! |
Dear @Malaevleo, For the BUSTED-E pipeline, I would suggest the following. Here I assume that you have a gene file and a tree file separate.
There are a lot of papers which do ad hoc statisical procedures which may or may not be sensible. It really depends on the data/question. I think for your analysis, what you want is something like BUSTED-PH (https://github.com/veg/hyphy-analyses/tree/master/BUSTED-PH) Since you know which branches are "long-lived", designate them as foreground (see https://github.com/veg/hyphy-analyses/tree/master/LabelTrees, may be useful), and the rest of the branches as foreground. Then, in addition to BUSTED-E as described above, run two additional analyses
Step 1 will tell you if you have selection on ANY of the foregound. You want a positive result for (1) and a negative result for (2). This is kind of similar to what the first paper you cited does, but without throwing away any data, and statistically sound You won't have branch-level resolution, i.e., won't know which branches in the FOREGROUND are selected, but I think all you really need is a dichotomous label : selected in species/clades of interest, not selection outside the those entities. Best, |
Thank you very much! This approach sounds great! After conducting analysis with BUSTED and aBSREL I want to use PGLS to figure out whether ω values are significantly associated with lifespans of bats. Basically do something like in this article previously mentioned: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-023-09554-4#Sec12 I wanted to ask for a advice about obtaining 'root-to-tip' omega values using HYPHY. In the paper researchers use free-ratio CODEML model from PAML but I really want to stick to the HYPHY if it is possible of course. Thank you very much for your help! |
Dear @Malaevleo, My understanding of what "free ratio" means is that you simply get a different ω per branch. You could consider a model where, for each tip, you draw a path to the root, like so Then you fit a model in which there's a single ω on all the path branches, and (nuisance) free ω parameters for the background branches. Finally, repeat the process for all tips. Two issues to consider
Is that something you had in mind? Currently, there is no analysis that does this exact procedure, but something like Best, PS Take a look at this paper where my group helped marine biologists with a similar analysis |
Thank you for your help! So, just to clarify, for such a procedure I will need to make trees where the the foreground branches are those that are on the path from root to the tip, while others are considered background. For foreground branches I use To sum up, I believe that possible pipeline should look like this (correct me if I am wrong of course):
Once again, thanks for great advice! Your help has been amazing and I am extremely grateful for it! |
Dear @Malaevleo, Yes, that all sounds good. For Step 5 I just added the Best, |
This is great, I can not thank you enough for your help! It's been a pleasure consulting with you about the pipeline. Once again, thank you really much! |
Hello!
First of all, I wanted to thank you for such a great software. I am amazed by the variety of tools present in it!
I wanted to clarify, whether I understand the principles of how it works correctly. I have a goal of detecting which aging-associated genes in 18 different bats experience positive selection (for each gene I have an alignment of the 18 bat proteins). Therefore, firstly, I have to conduct branch-level analysis. I don't know which branches should experience this phenomena, thus I will probably have to test all branches. Then, for branches that experienced positive selection, I want to look at the site-level selection.
Is it a good idea to try out aBSREL or BUSTED to detect which branches experience positive selection? If it is, then which tool is probably better for obtaining the omega values for each branch? Or should I just use FitMG94.bf for identifying branches as it is the most streamlined approach? After identifying branches, I plan to use MEME for site-specific selection analysis. Is this an adequate idea or am I missing the point?
Sorry for asking something that was asked here a lot, but I just wanted to make sure I understood everything correctly and there is no better way to do so than to ask you directly.
Thank you for your attention! Once again, thank you for your outstanding work!
The text was updated successfully, but these errors were encountered: