Estimation of alternative allele frequencies with metafounders #182

RosCraddock · 2024-11-12T16:54:42Z

Quite a long list of updates here!

New command -update_alt_allele_prob for re-estimating the alternative allele frequency on each peeling
New function updateMafAfterPeeling() to re-estimate the alternative allele frequencies based on the inferred genotype probabilities of the founders (triggered with new command
Restricted all alternative allele frequencies to be between 0.01 and 0.99 (including user-inputs).
Updates to documentation for -update_alt_allele_prob and noted the alternative allele frequency range (0.01 to 0.99).
Updates to functional testing for alternative allele frequency estimation.
New R script file for metafounder simulation (new code replacing the previous approach).
New simulation for accuracy testing of metafounders: one pedigree, one true genotype (dosage, phased, unphased), set of obsv genotypes with ~50% missing rate.
Five additional accuracy tests for metafounders. All only with the single method. I can add more (e.g multi, hybrid), but cautious since the summary table is already quite large.
Additions to the accuracy summary table printed in the terminal to include metafounder (as separate entry as uses different pedigree simulation).

@gregorgorjanc @XingerTang - when you have time, please let me know if you have any comments/questions/changes.

- New command -update_alt_allele_prob - New function updateMafAfterPeeling() triggered with new command - Set all alternative allele frequencies to be between 0.01 and 0.99 (including user-inputs). - Updates to documentation - Updates to functional testing. - New file for metafounder simulation (new code replacing previous approach). - New simulation for metafounder accuracy testing - Updates to accuracy tests for metafounders - Updates to terminal summary of accuracy tests to include metafounders

…phaPeel into feat_metafounders

Merge pull request AlphaGenes#175 from RosCraddock/feat_metafounders

…phaPeel into feat_metafounders

docs/source/usage.rst

gregorgorjanc · 2024-11-13T06:26:43Z

docs/source/usage.rst

@@ -133,7 +136,7 @@ For hybrid peeling, where a large amount (millions of segregating sites) of sequ

 The ``-geno_error_prob``, ``-seq_error_prob`` and ``-rec_length`` arguments control some of the model parameters used in the model. ``-seq_error_prob`` must not be zero. |Software| is robust to deviations in genotyping error rate and sequencing error rate so it is not recommended to use these options unless large deviations from the default are known. Changing the ``-length`` argument to match the genetic map length can increase accuracy in some situations.

-The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele probabilities after each peeling cycle. This option can be useful if there are a large number of non-genotyped founders. If both ``-alt_allele_prob_file`` and ``-est_alt_allele_prob`` are used, the inputted alternative allele probabilities are used as a starting point for alternative allele probabilities estimation.
+The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele frequencies before peeling using all available observed genotypes. This option can be useful if there are a large number of non-genotyped founders. ``-update_alt_allele_prob`` re-estimates the alternative allele frequencies per metafounder after each peeling cycle using the inferred genotype probabilities of the founders. For implementation of metafounders (**without** ``-alt_allele_prob_file``), both ``-est_alt_allele_prob`` and ``-update_alt_allele_prob`` should be used.


The last sentence on providing alternative allele prob file - I don’t fully follow what you mean here. Do you need to say more?

I will rephrase this. Here, I wanted to highlight that -est_alt_allele_prob estimates one alternative allele frequency for the base population using all available genotype data (so it ignores any genetic grouping). Thus, to utilise the metafounders/genetic grouping in the peeling, -update_alt_allele_prob is required as well.

gregorgorjanc · 2024-11-13T06:29:03Z

docs/source/usage.rst

@@ -342,7 +345,7 @@ Example:
 Model parameter files
 =====================

-|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``.
+|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. Here, all values will range from 0.01 to 0.99. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``.


@XingerTang we limit ourselves to 1%-99% to avoid getting “sucked” into 0, where then everything becomes “fixed”. Should we have a bit more lax range? Say, 0.001-0.999 or similar? How to do this well from numerical perspective?

I think we could enforce a more laxed range. I did a quick test with inputted alternative allele frequencies and the estimation via the Newton method, and it seems to be fine. But it would be good to know what @XingerTang thinks!

@RosCraddock I was asking this because some disease allele frequencies in a base population might be very low, much lower than 0.01, but we do want to avoid the "0 trap", which can occur during iteration/peeling cycles. The question now is what limits should we put. @XingerTang I would appreciate your math's input here;)

@RosCraddock one way to check this is to see in your real datasets you work with - if you get allele frequency that is 0.01 for any of the metafounders then we know that we have likely hit this imposed range-limit. That will give us a good test for how to manage these range-limits.

gregorgorjanc · 2024-11-13T06:36:48Z

src/tinypeel/Peeling/PeelingUpdates.py

@@ -111,6 +111,34 @@ def addIndividualToUpdate(d, p, LLp, LLpp):
    return LLp, LLpp


+def updateMafAfterPeeling(pedigree, peelingInfo):


@RosCraddock @XingerTang the code loops over metafounders and for each metafounder loops over all pedigree members twice - once to calculate and then to assign anterior probabilities. I am thinking out loud if we can speed this up in any way if we flip iteration around - over individuals and then over metafounders, but we would be doing the same amount of work, so it seems we can not.

Yes, I wondered the same when writing this!

src/tinypeel/tinypeel.py

gregorgorjanc · 2024-11-13T06:40:38Z

src/tinypeel/tinypeel.py

@@ -48,6 +56,10 @@ def runPeelingCycles(pedigree, peelingInfo, args, singleLocusMode=False):
                        peelingInfo.nLoci, 0.5, dtype=np.float32
                    )
    if args.est_alt_allele_prob:
+        if args.alt_allele_prob_file is not None:
+            warnings.warn(
+                "-est_alt_allele_prob will update the inputted alternative allele frequencies using all available observed genotypes. Therefore, will overwrite any differences between metafounders."


@RosCraddock say that current implementation overwrites differences. We might improve upon this later …

This is only true if a user uses both —alt_allele_prob_file and —est_alt_allele_prob. For now, I think I will rewrite this warning to recommend using —update_alt_allele_prob instead.

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R

gregorgorjanc · 2024-11-13T06:48:38Z

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R

+                         pullIbdHaplo(pop = pop))
+  data$genoIBS <- rbind(data$genoIBS,
+                        pullSegSiteGeno(pop = pop))
+  listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1)


@RosCraddock do you need this listLoci bit of code? Also, isn't your Genotype the same as genoIBS? I got the later via pullSegSiteGeno() and there is also pullSegSiteHaplo() which will give you haploIBS. That's all you need I think,

gregorgorjanc · 2024-11-13T06:49:57Z

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R

+  listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1)
+  marker <- ""
+  i <- 1
+  for (i in 1:length(listLoci)){


Here is an example where doing just paste(“1_”, 1:pop@nLoci, sep=“0”) would be a simpler and faster vectorised way of doing this, but you don’t need it at all - just use pullSegSiteHaplo(‘)

gregorgorjanc · 2024-11-13T06:52:24Z

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R

+# Get pedigree and assign metafounders
+pedigree <- data.frame(data$pedigree$id, data$pedigree$mid, data$pedigree$fid, data$pedigree$population, data$pedigree$generation)
+colnames(pedigree) <- c("id", "mid", "fid", "population", "generation")
+pedigree <- pedigree[c(401:1400),]


We have hardcoded numbers here but nGen comes into this script externally - will that clash with these hardcoded numbers here?

gregorgorjanc

Very good work @RosCraddock. I left some minor comments.

- Added detail to usage.rst for metafounders. - Change of warning to user where they have multiple metafounders in an alternative allele probability file and use -est_alt_allele_prob as well. - Removal of redundant code and tidying in accuracy simulation for metafounders.

- Correct ordering of metafounders in alt_allele_prob output when there are 10 or more. - Added a functional test to check this.

gregorgorjanc

Looks good @RosCraddock

RosCraddock added 5 commits November 12, 2024 16:25

Merge branch 'feat_metafounders' of https://github.com/RosCraddock/Al…

2f4f4d3

…phaPeel into feat_metafounders

Merge pull request #5 from AlphaGenes/feat_metafounders

689fd29

Merge pull request AlphaGenes#175 from RosCraddock/feat_metafounders

Pre-commit Corrections

b347466

Merge branch 'feat_metafounders' of https://github.com/RosCraddock/Al…

f778cc2

…phaPeel into feat_metafounders

gregorgorjanc reviewed Nov 13, 2024

View reviewed changes

docs/source/usage.rst Show resolved Hide resolved

gregorgorjanc reviewed Nov 13, 2024

View reviewed changes

src/tinypeel/tinypeel.py Show resolved Hide resolved

gregorgorjanc reviewed Nov 13, 2024

View reviewed changes

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R Show resolved Hide resolved

gregorgorjanc reviewed Nov 13, 2024

View reviewed changes

RosCraddock added 3 commits November 14, 2024 18:37

Corrections

ac219eb

- Added detail to usage.rst for metafounders. - Change of warning to user where they have multiple metafounders in an alternative allele probability file and use -est_alt_allele_prob as well. - Removal of redundant code and tidying in accuracy simulation for metafounders.

Minor correction

b1160db

Small correction to alt_allele_prob output

6ddead0

- Correct ordering of metafounders in alt_allele_prob output when there are 10 or more. - Added a functional test to check this.

gregorgorjanc approved these changes Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimation of alternative allele frequencies with metafounders #182

Estimation of alternative allele frequencies with metafounders #182

RosCraddock commented Nov 12, 2024

gregorgorjanc Nov 13, 2024

RosCraddock Nov 14, 2024

gregorgorjanc Nov 13, 2024

RosCraddock Nov 14, 2024

gregorgorjanc Nov 15, 2024

gregorgorjanc Nov 15, 2024

gregorgorjanc Nov 13, 2024

RosCraddock Nov 14, 2024

gregorgorjanc Nov 13, 2024

RosCraddock Nov 14, 2024

gregorgorjanc Nov 13, 2024

gregorgorjanc Nov 13, 2024

gregorgorjanc Nov 13, 2024

gregorgorjanc left a comment

gregorgorjanc left a comment

		@@ -111,6 +111,34 @@ def addIndividualToUpdate(d, p, LLp, LLpp):
		return LLp, LLpp


		def updateMafAfterPeeling(pedigree, peelingInfo):

Estimation of alternative allele frequencies with metafounders #182

Are you sure you want to change the base?

Estimation of alternative allele frequencies with metafounders #182

Conversation

RosCraddock commented Nov 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregorgorjanc left a comment

Choose a reason for hiding this comment

gregorgorjanc left a comment

Choose a reason for hiding this comment