Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choosing the optimal way to do exploratory analysis #1770

Open
Malaevleo opened this issue Nov 29, 2024 · 8 comments
Open

Choosing the optimal way to do exploratory analysis #1770

Malaevleo opened this issue Nov 29, 2024 · 8 comments

Comments

@Malaevleo
Copy link

Hello!

First of all, I wanted to thank you for such a great software. I am amazed by the variety of tools present in it!

I wanted to clarify, whether I understand the principles of how it works correctly. I have a goal of detecting which aging-associated genes in 18 different bats experience positive selection (for each gene I have an alignment of the 18 bat proteins). Therefore, firstly, I have to conduct branch-level analysis. I don't know which branches should experience this phenomena, thus I will probably have to test all branches. Then, for branches that experienced positive selection, I want to look at the site-level selection.

Is it a good idea to try out aBSREL or BUSTED to detect which branches experience positive selection? If it is, then which tool is probably better for obtaining the omega values for each branch? Or should I just use FitMG94.bf for identifying branches as it is the most streamlined approach? After identifying branches, I plan to use MEME for site-specific selection analysis. Is this an adequate idea or am I missing the point?

Sorry for asking something that was asked here a lot, but I just wanted to make sure I understood everything correctly and there is no better way to do so than to ask you directly.

Thank you for your attention! Once again, thank you for your outstanding work!

@spond
Copy link
Member

spond commented Dec 3, 2024

Dear @Malaevleo,

Couple of things.

  1. Unfortunately, none of our methods provide the ultimate resolution you seek (specific branches at specific sites, see https://pmc.ncbi.nlm.nih.gov/articles/PMC3395634/)
image
  1. I would suggest you started with error-correction enabled BUSTED-E method (https://www.biorxiv.org/content/10.1101/2024.11.13.620707v1) to identify which of your genes (as evolutionary units) may be subject to episodic diversifying selection. I would suggest using false discovery correction (FDR) on the p-values returned by BUSTED.

  2. You could use the output of the previous step to filter the alignments for possible errors (see hyphy error-filter --help), and then run them through aBSREL to find (gene-by-gene) which branches may be subject to selection. Except to find relatively few.

  3. You could also run MEME on the individual genes (all branches) to find sites possibly subject to selection.

If you use the results of aBSREL to then define branches to test wtih MEME, you are likely going to have very biased results. It's like getting a cloud of points on the plane, running some method to pick the ones that follow a linear trend, and then fitting a linear regression to the selected points. You are going to get a lot of false positive results because of cherry-picking.

Generally, the best strategy to boost power is to use some non-sequence based information to define your hypothesis as precisely as possible: select a subject of candidate genes, or a subset of candidate branches a priori.

HTH,
Sergei

@Malaevleo
Copy link
Author

Thank you very much for great advice!

Pipeline of using BUSTED-E and then inputting the identified genes into aBSREL sounds like the best possible option for me. I guess there is no point in also running MEME as these methods are sufficient.

I actually expect to find only a few genes, so everything sounds fairly reasonable. I can choose as test branches only those that are for long-living bats (there are only 6 of them out of 18 overall species), therefore making hypothesis more strict and boosting statistical power of methods. But I am not sure that other bats do not experience positive selection in these genes. Thus, for aBSREL I am more confident in testing all branches and then choosing genes that experience positive selection only in the branches of interest and nowhere else. I've read this article (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-023-09554-4#Sec12) and here the researchers straight up remove test branches and run M8 vs M8a on the alignments without long-living species to determine genes that experience positive selection in the non long-living species. I am not exactly confident in this approach. In the article from Bat1K consortium (https://www.nature.com/articles/s41586-020-2486-3#Sec10) upon doing aBSREL researchers used all brances as test ones and then chose genes that experience positive selection only in bat common ancestor. This seems more fair in my opinion but maybe I am wrong. To summarize, while it is possible to define only a subset of branches as test ones, I can not be certain that this genes do not experience positive selection in other branches, therefore I am more confident in doing exploratory analysis (even though power will be lost) with all branches being tested and then choosing only those genes that experience positive selection in branches that I am interested in. Is my approach reasonable or am I wrong?

Also, on the note of BUSTED-E just to clarify: am I right that to run it I have to, firstly, use hyphy busted --eror-sink Yes and then I need to filter out the errors using hyphy error-filter and then I can run aBSREL on the filtered MSAs?

Once again, thank you for your attention and I am really grateful for great advice!

@spond
Copy link
Member

spond commented Dec 3, 2024

Dear @Malaevleo,

For the BUSTED-E pipeline, I would suggest the following. Here I assume that you have a gene file and a tree file separate.

### run standard BUSTED to get initial parameter estimates and improve convergence

hyphy busted --alignment file.fas --tree tree.nwk --starting-points 5 --intermediate-fits file.cache --output file.BUSTED.json

### run BUSTED-E (this is the gene-level result step; take the p-value from here)

hyphy busted --alignment file.fas --tree tree.nwk --starting-points 5 --intermediate-fits file.cache  --error-sink Yes --output file.BUSTED-E.json

### run filter

hyphy error-filter --json file.BUSTED-E.json --output file.filtered.fas --output-json file.filtered-info.json

### run aBSREL on the filtered dataset

hyphy absrel --alignment file.filtered.fas ...

There are a lot of papers which do ad hoc statisical procedures which may or may not be sensible. It really depends on the data/question.

I think for your analysis, what you want is something like BUSTED-PH (https://github.com/veg/hyphy-analyses/tree/master/BUSTED-PH)

Since you know which branches are "long-lived", designate them as foreground (see https://github.com/veg/hyphy-analyses/tree/master/LabelTrees, may be useful), and the rest of the branches as foreground.

Then, in addition to BUSTED-E as described above, run two additional analyses

  1. BUSTED-E on the foregound (--branches FOREGROUND, changing the label if needed)
  2. BUSTED-E on the background (--branches "Unlabeled branches", changing the label if needed)

Step 1 will tell you if you have selection on ANY of the foregound.
Step 2 will tell you if you have selection on ANY of the background.

You want a positive result for (1) and a negative result for (2). This is kind of similar to what the first paper you cited does, but without throwing away any data, and statistically sound

You won't have branch-level resolution, i.e., won't know which branches in the FOREGROUND are selected, but I think all you really need is a dichotomous label : selected in species/clades of interest, not selection outside the those entities.

Best,
Sergei

@Malaevleo
Copy link
Author

Thank you very much! This approach sounds great!

After conducting analysis with BUSTED and aBSREL I want to use PGLS to figure out whether ω values are significantly associated with lifespans of bats. Basically do something like in this article previously mentioned: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-023-09554-4#Sec12

I wanted to ask for a advice about obtaining 'root-to-tip' omega values using HYPHY. In the paper researchers use free-ratio CODEML model from PAML but I really want to stick to the HYPHY if it is possible of course.
image
What can be considered an optimal approach for this task? While it is possible to extract mean ω for each branch from aBSREL (I remember that doing aBSREL for obtaining mean ω is considered an 'overkill' but since I already need aBSREL for my branch-level positive selection analysis, it sounds about right), from what I understand they won't be exactly 'root-to-tip', therefore, their usage for PGLS won't exactly be justified. For example, if I observe high mean ω value for branch leading to common ancestor of Myotis genus but not for terminal branches leading from Myotis common ancestor to the different Myotis species, it won't be justified to use values for terminal branches for PGLS as they won't show positive selection for those species while I know that they experienced it before separating into different Myotis species. So, for PGLS I am interested in using ω that is from root to the tip of the tree. Is there a convenient way to obtain such values using HYPHY? I hope that I managed to explain idea in a comperhensive way. If concept of this analysis sounds odd and there are obvious issues with it, please, criticize it.

Thank you very much for your help!

@spond
Copy link
Member

spond commented Dec 4, 2024

Dear @Malaevleo,

My understanding of what "free ratio" means is that you simply get a different ω per branch. You could consider a model where, for each tip, you draw a path to the root, like so

image

Then you fit a model in which there's a single ω on all the path branches, and (nuisance) free ω parameters for the background branches. Finally, repeat the process for all tips. Two issues to consider

  1. Rooting is kind of a misnomer for HyPhy (and PAML), since all models are time reversible. You can definitely root on an outgroup, but be aware that things like branch lengths for the children of the root are ill conditioned (you can only estimate the sum)

  2. Obviously, ω estimates between closely related tips which share much of their ancestry will be correlated.

Is that something you had in mind? Currently, there is no analysis that does this exact procedure, but something like FitMG94 (https://github.com/veg/hyphy-analyses/tree/master/FitMG94) can be modified for that purpose pretty easily.

Best,
Sergei

PS Take a look at this paper where my group helped marine biologists with a similar analysis

@Malaevleo
Copy link
Author

Thank you for your help!

So, just to clarify, for such a procedure I will need to make trees where the the foreground branches are those that are on the path from root to the tip, while others are considered background. For foreground branches I use --type global of FitMG94, since I want them to have a single ω. Thus, since I have 18 bats, I will have 18 trees for each gene with different foreground branches corresponding to each path to the tip? If I understood everything correctly, this sounds fitting for the PGLS idea that I have.

To sum up, I believe that possible pipeline should look like this (correct me if I am wrong of course):

  1. Run BUSTED to get initial parameter estimates and improve convergence
  2. Run BUSTED-E firstly (since from the article you have provided it is quite clear that this method probably avoids positive selection overestimation issue and overall is the best suit for initial analysis) on trees where long-living bats are foreground, then on trees where long-living bats are background. For genes that experience positive selection in long-living bats proceed to the next steps
  3. Run BUSTED-PH to see if positive selection is associated with longevity trait
  4. Run aBSREL to look at which branches the positive selection might occur
  5. Run FitMG94 with --type global option where for each gene I have 18 trees where foreground is specified as path to the chosen species. Obtained ω values then might be used for PGLS analysis

Once again, thanks for great advice! Your help has been amazing and I am extremely grateful for it!

@spond
Copy link
Member

spond commented Dec 6, 2024

Dear @Malaevleo,

Yes, that all sounds good. For Step 5 I just added the --type lineage option to v0.4 of FitMG94.bf, so it does the labeling for you automatically and also treats the branches not on the root-to-tip path as nuisance parameters (with each branch having its own ω).

Best,
Sergei

@Malaevleo
Copy link
Author

This is great, I can not thank you enough for your help! It's been a pleasure consulting with you about the pipeline. Once again, thank you really much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants