-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimation of alternative allele frequencies with metafounders #182
base: feat_metafounders
Are you sure you want to change the base?
Estimation of alternative allele frequencies with metafounders #182
Conversation
- New command -update_alt_allele_prob - New function updateMafAfterPeeling() triggered with new command - Set all alternative allele frequencies to be between 0.01 and 0.99 (including user-inputs). - Updates to documentation - Updates to functional testing. - New file for metafounder simulation (new code replacing previous approach). - New simulation for metafounder accuracy testing - Updates to accuracy tests for metafounders - Updates to terminal summary of accuracy tests to include metafounders
…phaPeel into feat_metafounders
Merge pull request AlphaGenes#175 from RosCraddock/feat_metafounders
…phaPeel into feat_metafounders
docs/source/usage.rst
Outdated
@@ -133,7 +136,7 @@ For hybrid peeling, where a large amount (millions of segregating sites) of sequ | |||
|
|||
The ``-geno_error_prob``, ``-seq_error_prob`` and ``-rec_length`` arguments control some of the model parameters used in the model. ``-seq_error_prob`` must not be zero. |Software| is robust to deviations in genotyping error rate and sequencing error rate so it is not recommended to use these options unless large deviations from the default are known. Changing the ``-length`` argument to match the genetic map length can increase accuracy in some situations. | |||
|
|||
The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele probabilities after each peeling cycle. This option can be useful if there are a large number of non-genotyped founders. If both ``-alt_allele_prob_file`` and ``-est_alt_allele_prob`` are used, the inputted alternative allele probabilities are used as a starting point for alternative allele probabilities estimation. | |||
The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele frequencies before peeling using all available observed genotypes. This option can be useful if there are a large number of non-genotyped founders. ``-update_alt_allele_prob`` re-estimates the alternative allele frequencies per metafounder after each peeling cycle using the inferred genotype probabilities of the founders. For implementation of metafounders (**without** ``-alt_allele_prob_file``), both ``-est_alt_allele_prob`` and ``-update_alt_allele_prob`` should be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last sentence on providing alternative allele prob file - I don’t fully follow what you mean here. Do you need to say more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rephrase this. Here, I wanted to highlight that -est_alt_allele_prob
estimates one alternative allele frequency for the base population using all available genotype data (so it ignores any genetic grouping). Thus, to utilise the metafounders/genetic grouping in the peeling, -update_alt_allele_prob
is required as well.
@@ -342,7 +345,7 @@ Example: | |||
Model parameter files | |||
===================== | |||
|
|||
|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``. | |||
|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. Here, all values will range from 0.01 to 0.99. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XingerTang we limit ourselves to 1%-99% to avoid getting “sucked” into 0, where then everything becomes “fixed”. Should we have a bit more lax range? Say, 0.001-0.999 or similar? How to do this well from numerical perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could enforce a more laxed range. I did a quick test with inputted alternative allele frequencies and the estimation via the Newton method, and it seems to be fine. But it would be good to know what @XingerTang thinks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock I was asking this because some disease allele frequencies in a base population might be very low, much lower than 0.01, but we do want to avoid the "0 trap", which can occur during iteration/peeling cycles. The question now is what limits should we put. @XingerTang I would appreciate your math's input here;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock one way to check this is to see in your real datasets you work with - if you get allele frequency that is 0.01 for any of the metafounders then we know that we have likely hit this imposed range-limit. That will give us a good test for how to manage these range-limits.
@@ -111,6 +111,34 @@ def addIndividualToUpdate(d, p, LLp, LLpp): | |||
return LLp, LLpp | |||
|
|||
|
|||
def updateMafAfterPeeling(pedigree, peelingInfo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock @XingerTang the code loops over metafounders and for each metafounder loops over all pedigree members twice - once to calculate and then to assign anterior probabilities. I am thinking out loud if we can speed this up in any way if we flip iteration around - over individuals and then over metafounders, but we would be doing the same amount of work, so it seems we can not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wondered the same when writing this!
src/tinypeel/tinypeel.py
Outdated
@@ -48,6 +56,10 @@ def runPeelingCycles(pedigree, peelingInfo, args, singleLocusMode=False): | |||
peelingInfo.nLoci, 0.5, dtype=np.float32 | |||
) | |||
if args.est_alt_allele_prob: | |||
if args.alt_allele_prob_file is not None: | |||
warnings.warn( | |||
"-est_alt_allele_prob will update the inputted alternative allele frequencies using all available observed genotypes. Therefore, will overwrite any differences between metafounders." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock say that current implementation overwrites differences. We might improve upon this later …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only true if a user uses both —alt_allele_prob_file
and —est_alt_allele_prob
. For now, I think I will rewrite this warning to recommend using —update_alt_allele_prob
instead.
tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R
Show resolved
Hide resolved
pullIbdHaplo(pop = pop)) | ||
data$genoIBS <- rbind(data$genoIBS, | ||
pullSegSiteGeno(pop = pop)) | ||
listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock do you need this listLoci bit of code? Also, isn't your Genotype the same as genoIBS? I got the later via pullSegSiteGeno() and there is also pullSegSiteHaplo() which will give you haploIBS. That's all you need I think,
listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1) | ||
marker <- "" | ||
i <- 1 | ||
for (i in 1:length(listLoci)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example where doing just paste(“1_”, 1:pop@nLoci, sep=“0”) would be a simpler and faster vectorised way of doing this, but you don’t need it at all - just use pullSegSiteHaplo(‘)
# Get pedigree and assign metafounders | ||
pedigree <- data.frame(data$pedigree$id, data$pedigree$mid, data$pedigree$fid, data$pedigree$population, data$pedigree$generation) | ||
colnames(pedigree) <- c("id", "mid", "fid", "population", "generation") | ||
pedigree <- pedigree[c(401:1400),] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have hardcoded numbers here but nGen comes into this script externally - will that clash with these hardcoded numbers here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work @RosCraddock. I left some minor comments.
- Added detail to usage.rst for metafounders. - Change of warning to user where they have multiple metafounders in an alternative allele probability file and use -est_alt_allele_prob as well. - Removal of redundant code and tidying in accuracy simulation for metafounders.
- Correct ordering of metafounders in alt_allele_prob output when there are 10 or more. - Added a functional test to check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @RosCraddock
Quite a long list of updates here!
-update_alt_allele_prob
for re-estimating the alternative allele frequency on each peeling@gregorgorjanc @XingerTang - when you have time, please let me know if you have any comments/questions/changes.