Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimation of alternative allele frequencies with metafounders #182

Open
wants to merge 8 commits into
base: feat_metafounders
Choose a base branch
from

Conversation

RosCraddock
Copy link

Quite a long list of updates here!

  • New command -update_alt_allele_prob for re-estimating the alternative allele frequency on each peeling
  • New function updateMafAfterPeeling() to re-estimate the alternative allele frequencies based on the inferred genotype probabilities of the founders (triggered with new command
  • Restricted all alternative allele frequencies to be between 0.01 and 0.99 (including user-inputs).
  • Updates to documentation for -update_alt_allele_prob and noted the alternative allele frequency range (0.01 to 0.99).
  • Updates to functional testing for alternative allele frequency estimation.
  • New R script file for metafounder simulation (new code replacing the previous approach).
  • New simulation for accuracy testing of metafounders: one pedigree, one true genotype (dosage, phased, unphased), set of obsv genotypes with ~50% missing rate.
  • Five additional accuracy tests for metafounders. All only with the single method. I can add more (e.g multi, hybrid), but cautious since the summary table is already quite large.
  • Additions to the accuracy summary table printed in the terminal to include metafounder (as separate entry as uses different pedigree simulation).

@gregorgorjanc @XingerTang - when you have time, please let me know if you have any comments/questions/changes.

- New command -update_alt_allele_prob
- New function updateMafAfterPeeling() triggered with new command
- Set all alternative allele frequencies to be between 0.01 and 0.99 (including user-inputs).
- Updates to documentation
- Updates to functional testing.
- New file for metafounder simulation (new code replacing previous approach).
- New simulation for metafounder accuracy testing
- Updates to accuracy tests for metafounders
- Updates to terminal summary of accuracy tests to include metafounders
Merge pull request AlphaGenes#175 from RosCraddock/feat_metafounders
@@ -133,7 +136,7 @@ For hybrid peeling, where a large amount (millions of segregating sites) of sequ

The ``-geno_error_prob``, ``-seq_error_prob`` and ``-rec_length`` arguments control some of the model parameters used in the model. ``-seq_error_prob`` must not be zero. |Software| is robust to deviations in genotyping error rate and sequencing error rate so it is not recommended to use these options unless large deviations from the default are known. Changing the ``-length`` argument to match the genetic map length can increase accuracy in some situations.

The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele probabilities after each peeling cycle. This option can be useful if there are a large number of non-genotyped founders. If both ``-alt_allele_prob_file`` and ``-est_alt_allele_prob`` are used, the inputted alternative allele probabilities are used as a starting point for alternative allele probabilities estimation.
The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele frequencies before peeling using all available observed genotypes. This option can be useful if there are a large number of non-genotyped founders. ``-update_alt_allele_prob`` re-estimates the alternative allele frequencies per metafounder after each peeling cycle using the inferred genotype probabilities of the founders. For implementation of metafounders (**without** ``-alt_allele_prob_file``), both ``-est_alt_allele_prob`` and ``-update_alt_allele_prob`` should be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last sentence on providing alternative allele prob file - I don’t fully follow what you mean here. Do you need to say more?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rephrase this. Here, I wanted to highlight that -est_alt_allele_prob estimates one alternative allele frequency for the base population using all available genotype data (so it ignores any genetic grouping). Thus, to utilise the metafounders/genetic grouping in the peeling, -update_alt_allele_prob is required as well.

@@ -342,7 +345,7 @@ Example:
Model parameter files
=====================

|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``.
|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. Here, all values will range from 0.01 to 0.99. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XingerTang we limit ourselves to 1%-99% to avoid getting “sucked” into 0, where then everything becomes “fixed”. Should we have a bit more lax range? Say, 0.001-0.999 or similar? How to do this well from numerical perspective?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could enforce a more laxed range. I did a quick test with inputted alternative allele frequencies and the estimation via the Newton method, and it seems to be fine. But it would be good to know what @XingerTang thinks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RosCraddock I was asking this because some disease allele frequencies in a base population might be very low, much lower than 0.01, but we do want to avoid the "0 trap", which can occur during iteration/peeling cycles. The question now is what limits should we put. @XingerTang I would appreciate your math's input here;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RosCraddock one way to check this is to see in your real datasets you work with - if you get allele frequency that is 0.01 for any of the metafounders then we know that we have likely hit this imposed range-limit. That will give us a good test for how to manage these range-limits.

@@ -111,6 +111,34 @@ def addIndividualToUpdate(d, p, LLp, LLpp):
return LLp, LLpp


def updateMafAfterPeeling(pedigree, peelingInfo):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RosCraddock @XingerTang the code loops over metafounders and for each metafounder loops over all pedigree members twice - once to calculate and then to assign anterior probabilities. I am thinking out loud if we can speed this up in any way if we flip iteration around - over individuals and then over metafounders, but we would be doing the same amount of work, so it seems we can not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wondered the same when writing this!

@@ -48,6 +56,10 @@ def runPeelingCycles(pedigree, peelingInfo, args, singleLocusMode=False):
peelingInfo.nLoci, 0.5, dtype=np.float32
)
if args.est_alt_allele_prob:
if args.alt_allele_prob_file is not None:
warnings.warn(
"-est_alt_allele_prob will update the inputted alternative allele frequencies using all available observed genotypes. Therefore, will overwrite any differences between metafounders."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RosCraddock say that current implementation overwrites differences. We might improve upon this later …

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true if a user uses both —alt_allele_prob_file and —est_alt_allele_prob. For now, I think I will rewrite this warning to recommend using —update_alt_allele_prob instead.

pullIbdHaplo(pop = pop))
data$genoIBS <- rbind(data$genoIBS,
pullSegSiteGeno(pop = pop))
listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RosCraddock do you need this listLoci bit of code? Also, isn't your Genotype the same as genoIBS? I got the later via pullSegSiteGeno() and there is also pullSegSiteHaplo() which will give you haploIBS. That's all you need I think,

listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1)
marker <- ""
i <- 1
for (i in 1:length(listLoci)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example where doing just paste(“1_”, 1:pop@nLoci, sep=“0”) would be a simpler and faster vectorised way of doing this, but you don’t need it at all - just use pullSegSiteHaplo(‘)

# Get pedigree and assign metafounders
pedigree <- data.frame(data$pedigree$id, data$pedigree$mid, data$pedigree$fid, data$pedigree$population, data$pedigree$generation)
colnames(pedigree) <- c("id", "mid", "fid", "population", "generation")
pedigree <- pedigree[c(401:1400),]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have hardcoded numbers here but nGen comes into this script externally - will that clash with these hardcoded numbers here?

Copy link
Member

@gregorgorjanc gregorgorjanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work @RosCraddock. I left some minor comments.

- Added detail to usage.rst for metafounders.
- Change of warning to user where they have multiple metafounders in an alternative allele probability file and use -est_alt_allele_prob as well.
- Removal of redundant code and tidying in accuracy simulation for metafounders.
- Correct ordering of metafounders in alt_allele_prob output when there are 10 or more.
- Added a functional test to check this.
Copy link
Member

@gregorgorjanc gregorgorjanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @RosCraddock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants