From 10eabc7129f3f092eb84ac1b3b54a79a14cbfd0d Mon Sep 17 00:00:00 2001 From: Will Dumm Date: Thu, 8 Aug 2024 13:09:26 -0400 Subject: [PATCH] Add filtering by reversions to naive sequence, generalize ranking strategy (#130) * add reversion func and rewrite ranking again * update tests * working ranking expression parsing * cleanup, format, and lint * fix forest summary * more fixes and cleanup * finish fixing output for zero coefficients * WIP * remove reversion counting from default ordering * fix tree counting with pickled forest input * add nucleotide annotations * change to default minimize all criteria * simplify ranking assuming all criteria are minimized * WIP add linear combination to tree_stats pairplot * format and lint * fix quickstart and docs * restore backward compatibility and tests --- docs/index.rst | 2 - docs/quickstart.rst | 78 +++-- gctree/branching_processes.py | 593 +++++++++++++++++++++++++--------- gctree/cli.py | 86 +++-- gctree/isotyping.py | 2 +- gctree/mutation_model.py | 8 +- tests/smalltest.sh | 36 ++- 7 files changed, 582 insertions(+), 223 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index b1e8fd56..ea31d15c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -34,8 +34,6 @@ gctree documentation :maxdepth: 1 :caption: Notes - CHANGELOG - faq Indices and tables diff --git a/docs/quickstart.rst b/docs/quickstart.rst index fdf1fcdc..735f4fb5 100644 --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -67,8 +67,8 @@ You now have two new files, ``outfile`` and ``outtree``. Note: if you want to rerun the above ``dnapars`` command, you must delete these two files first! -gctree -====== +Gctree Ranking +============== We're now ready to run ``gctree infer`` to use abundance data (in ``abundances.csv``) to rank the equally parsimonious trees (in ``outfile``). We can use the optional argument ``--frame`` to indicate the coding frame of the sequence start, so that amino acid substitutions can be annotated on our trees. @@ -106,48 +106,82 @@ This file may be manipulated using ``gctree infer``, instead of providing a dnapars ``outfile``. .. note:: - Although described below, using context likelihood, mutability parsimony, or isotype parsimony - as ranking criteria is experimental, and has not yet been shown in a careful - validation to improve tree inference. Only the default branching process - likelihood is recommended for tree ranking! + Although described below, using context likelihood log loss, mutability parsimony, or isotype parsimony + as ranking criteria is experimental, and has not yet been shown in a careful + validation to improve tree inference. Only the default branching process + likelihood is recommended for tree ranking! Criteria other than branching process likelihoods can be used to break ties between trees. Providing arguments ``--isotype_mapfile`` and ``--idmapfile`` will allow trees to be ranked by isotype parsimony. Providing arguments ``--mutability`` and ``--substitution`` allows trees to be ranked -according to a context-sensitive mutation model. By default, trees are ranked -lexicographically, first maximizing likelihood, then minimizing isotype -parsimony, and finally maximizing a context-based poisson likelihood, if such information is provided. -Ranking priorities can be adjusted using the argument ``--ranking_coeffs``. - -For example, to find the optimal tree -according to a linear combination of likelihood, isotype parsimony, -mutabilities, and alleles: - -.. command-output:: gctree infer gctree.out.inference.parsimony_forest.p --frame 1 --idmap idmap.txt --isotype_mapfile ../example/isotypemap.txt --mutability ../HS5F_Mutability.csv --substitution ../HS5F_Substitution.csv --ranking_coeffs 1 0.1 0 --outbase newranking --summarize_forest --tree_stats --verbose +according to a context-sensitive Poisson model. By default, trees are ranked +lexicographically, first minimizing branching process log loss (negative +log-likelihood), then minimizing isotype +parsimony, minimizing context-based poisson log loss, and finally minimizing number of alleles, +if required information for all criteria is provided. Criteria for which +required information is not provided will be skipped. + +Ranking priorities can be adjusted using the argument ``--ranking_strategy``. +This argument accepts a string describing either an alternative lexicographic +ordering, or a linear combination of criteria to be minimized. The following +identifiers are used to specify available ranking criteria: + +* ``B`` - branching process log loss +* ``I`` - isotype parsimony +* ``C`` - context-based Poisson log loss +* ``M`` - old mutability parsimony +* ``A`` - number of alleles +* ``R`` - sitewise reversions to naive sequence + +An alternative lexicographic ordering can be specified using a comma-separated +list of identifiers. For example by passing ``--ranking_strategy "C, B, R"`` to +``gctree infer``, we will minimize context-based Poisson log loss, then +branching process log-loss, and finally naive reversions. If for some reason +the user wishes to maximize instead of minimizing a criterion, a negative +coefficient can be provided, as in ``"C, B, -R"``. To compute the value of +a criterion without using it for ranking, a coefficient of zero can be +prepended to its identifier. + +Alternatively, the ``--ranking_strategy`` option can be used to rank trees to +minimize a linear combination of criteria. +For example, to find the optimal tree according to a linear combination of +branching process log loss, isotype parsimony, context-based Poisson log loss, +and alleles: + +.. command-output:: gctree infer gctree.out.inference.parsimony_forest.p \ + --frame 1 \ + --idmap idmap.txt \ + --isotype_mapfile ../example/isotypemap.txt \ + --mutability ../HS5F_Mutability.csv \ + --substitution ../HS5F_Substitution.csv \ + --ranking_strategy "B + I + 0.1C + 0.01A" \ + --outbase newranking --summarize_forest \ + --tree_stats \ + --verbose :shell: The files ``HS5F_Mutability.csv`` and ``HS5F_Substitution.csv`` are a context -sensitive mutation model which can be downloaded from the `Shazam Project _. +sensitive mutation model which can be downloaded from the `Shazam Project `_, and are required here to compute context-based Poisson log loss. By default, only the files listed above will be generated, with the optional argument ``--outbase`` specifying how the output files should be named. .. image:: newranking.inference.1.svg :width: 1000 -For detailed information about each tree used for ranking, as well as a pairplot like the one below comparing the highest ranked tree to all other ranked trees,use the argument ``--tree_stats``. +For detailed information about each tree used for ranking, as well as a pairplot like the one below comparing the highest ranked tree to all other ranked trees, use the argument ``--tree_stats``. -.. image:: newranking.tree_stats.pairplot.png +.. image:: newranking.tree_stats.pairplot.svg :width: 1000 -Sometimes ranked trees are too numerous, and generating the output of ``--tree_stats`` would require too many resources. For a summary of the collection of trees used for ranking, the argument ``--summarize_forest`` is provided. Most importantly, this option summarizes how much less likely the top ranked tree is, compared to the most likely tree being ranked, for example to validate coefficients passed to ``--ranking_coeffs``. +Sometimes ranked trees are too numerous to generate the output of ``--tree_stats`` in a reasonable amount of time. For a summary of the collection of trees used for ranking, the option ``--summarize_forest`` is provided. Most importantly, this option summarizes how much less likely the top ranked tree is, compared to the most likely tree being ranked, for example to validate coefficients passed to ``--ranking_strategy``. .. command-output:: cat newranking.forest_summary.log :shell: -isotype -======= +Isotypes +======== If we would like to add observed isotype data to trees output by gctree inference, we can now do so. diff --git a/gctree/branching_processes.py b/gctree/branching_processes.py index 08cd2b6d..b382b90e 100755 --- a/gctree/branching_processes.py +++ b/gctree/branching_processes.py @@ -45,6 +45,10 @@ ) +class RankingCriterionError(Exception): + pass + + class CollapsedTree: r"""A collapsed tree, modeled as an infinite type Galton-Watson process run to extinction. @@ -461,6 +465,7 @@ def render( frame2: Optional[int] = None, position_map2: Optional[List] = None, show_support: bool = False, + show_nuc_muts: bool = False, ): r"""Render to tree image file. @@ -477,6 +482,8 @@ def render( frame2: coding frame for 2nd sequence when using ``chain_split`` position_map2: like ``position_map``, but for 2nd sequence when using ``chain_split`` show_support: annotate bootstrap support if available + show_nuc_muts: If True, annotate branches with nucleotide mutations. If False, and frame is provided, then branches + will be annotated with amino acid mutations. """ if frame is not None and frame not in (1, 2, 3): raise RuntimeError("frame must be 1, 2, or 3") @@ -560,7 +567,28 @@ def my_layout(node): if "sequence" in tree_copy.features and set( node.sequence.upper() ) == set("ACGT"): - if frame is not None: + if show_nuc_muts: + mutations = [ + f"{pn}{idx + 1}{cn}" + for idx, (pn, cn) in enumerate( + zip(node.up.sequence, node.sequence) + ) + if pn != cn + ] + if mutations: + T = ete3.TextFace( + "\n".join(mutations), + fsize=6, + tight_text=False, + ) + T.margin_top = 6 + T.rotation = -90 + node.add_face( + T, + 0, + position=("branch-bottom"), + ) + elif frame is not None: if chain_split is not None and frame2 is None: raise ValueError( "must define frame2 when using chain_split" @@ -622,6 +650,7 @@ def my_layout(node): ) if "*" in aa: nstyle["hz_line_color"] = "red" + node.set_style(nstyle) ts = ete3.TreeStyle() @@ -1135,7 +1164,7 @@ def mle(self, **kwargs) -> Tuple[np.float64, np.float64]: @_requires_dag def filter_trees( # noqa: C901 self, - ranking_coeffs: Optional[Sequence[float]] = None, + ranking_strategy: Optional[str] = None, mutability_file: Optional[str] = None, substitution_file: Optional[str] = None, ignore_isotype: bool = False, @@ -1144,26 +1173,23 @@ def filter_trees( # noqa: C901 outbase: str = "gctree.out", summarize_forest: bool = False, tree_stats: bool = False, + img_type: str = "svg", + ranking_coeffs: Optional[Sequence[float]] = None, branching_process_ranking_coeff: float = -1, use_old_mut_parsimony: bool = False, ) -> CollapsedForest: """Filter trees according to specified criteria. - Trim the forest to minimize a linear - combination of branching process likelihood, isotype parsimony score, - context/mutability-based Poisson likelihood, and number of alleles, with coefficients - provided in the argument ``ranking_coeffs`, in that order. + By default, the forest will be trimmed to maximize branching process likelihood, + then minimize isotype parsimony, then maximize context-based Poisson likelihood, + and finally minimize number of alleles. Any criteria for which the necessary arguments + are not provided will be automatically ignored. + + For other ranking strategies, see the `ranking_strategy` argument. Args: - ranking_coeffs: A list or tuple of coefficients for prioritizing tree weights. - The order of coefficients is: isotype parsimony score, context poisson likelihood, - and number of alleles. A coefficient of ``-1`` will be applied to branching process - likelihood by default, unless a different value is provided to the keyword argument - `branching_process_ranking_coeff`. Trees are chosen to minimize this linear combination - of tree weights, so weights for which larger values are more optimal (such as - likelihoods) should have negative coefficients. - If ranking_coeffs is not provided, trees will be ranked lexicographically - by likelihood, then by other traits, in the same order. + ranking_strategy: A string expression describing how to rank trees. See docs for command line + argument `--ranking_strategy` for description. mutability_file: A mutability model substitution_file: A substitution model ignore_isotype: Ignore isotype parsimony when ranking. By default, isotype information added with @@ -1174,9 +1200,20 @@ def filter_trees( # noqa: C901 outbase: file name stem for a file with information for each tree in the DAG. summarize_forest: whether to write a summary of the forest to file `[outbase].forest_summary.log` tree_stats: whether to write stats for each tree in the forest to file `[outbase].tree_stats.log` - branching_process_ranking_coeff: Ranking coefficient to use for branching process likelihood. Value + img_type: format for output plots. + ranking_coeffs: (Deprecated. Use ``ranking_strategy`` instead) A list or tuple of coefficients + for prioritizing tree weights. + The order of coefficients is: isotype parsimony score, context poisson likelihood, + and number of alleles. A coefficient of ``-1`` will be applied to branching process + likelihood by default, unless a different value is provided to the keyword argument + `branching_process_ranking_coeff`. Trees are chosen to minimize this linear combination + of tree weights, so weights for which larger values are more optimal (such as + likelihoods) should have negative coefficients. + branching_process_ranking_coeff: (Deprecated. Use ``ranking_strategy`` instead) + Ranking coefficient to use for branching process likelihood. Value is ignored unless `ranking_coeffs` argument is provided. - use_old_mut_parsimony: Whether to use the deprecated 'mutability parsimony' instead of + use_old_mut_parsimony: (Deprecated. Use ``ranking_strategy`` instead) + Whether to use the deprecated 'mutability parsimony' instead of context-based poisson likelihood (only applicable if mutability and substitution files are provided. @@ -1185,40 +1222,35 @@ def filter_trees( # noqa: C901 of data about the trees in that forest, with format (branching process likelihood, isotype parsimony, context-based Poisson likelihood, alleles). """ - dag = self._forest - if ranking_coeffs: - if len(ranking_coeffs) != 3: - raise ValueError( - "If ranking_coeffs are provided to `filter_trees` method, a list of three values is expected." - ) - coeffs = [branching_process_ranking_coeff] + list(ranking_coeffs) - if sum(abs(c) for c in coeffs) == 0: - raise ValueError( - "At least one value provided to ranking_coeffs or the value of branching_process_ranking_coeff must be nonzero." - ) - else: - coeffs = [1] * 4 - - ( - nz_coeff_bplikelihood, - nz_coeff_isotype_pars, - nz_coeff_context, - _, - ) = [val != 0 for val in coeffs] - coeff_bplikelihood, coeff_isotype_pars, coeff_context, coeff_alleles = coeffs - - dag_filters = [] - if nz_coeff_bplikelihood: + # To add a new ranking criterion: + # * Add a prepare_func below which checks for prerequisites + # and throws a RankingCriterionError if not met + # * Add a key, value pair to ranking_function_keys so that a user + # can use the new criterion + # * Add the prepare_func to the list of default ranking + # constructors, if desired. + # * Modify docs in cli.py to describe new criterion + # * All ranking criteria should be minimized by default. + + def prepare_bp_likelihood_funcs(): if self.parameters is None: self.mle(marginal=True) p, q = self.parameters ll_dagfuncs = _ll_genotype_dagfuncs(p, q) - dag_filters.append((ll_dagfuncs, coeff_bplikelihood)) if verbose: print(f"Branching process parameters to be used for ranking: {(p, q)}") - if nz_coeff_isotype_pars and self.is_isotyped and (not ignore_isotype): - # Check for missing isotype data in all but root node, and fake root-adjacent leaf node + return ll_dagfuncs + + def prepare_isotype_parsimony_funcs(): + if not self.is_isotyped: + raise RankingCriterionError( + "Isotypes have not been added to this CollapsedForest." + ) + if ignore_isotype: + raise RankingCriterionError( + "Isotype ranking is disabled by passed parameters." + ) rootname = list(self._forest.dagroot.children())[0].attr["name"] if any( not node.attr["isotype"] @@ -1229,66 +1261,227 @@ def filter_trees( # noqa: C901 "Some isotype data seems to be missing. Isotype parsimony scores may be incorrect." ) - iso_funcs = _isotype_dagfuncs() - dag_filters.append((iso_funcs, coeff_isotype_pars)) - if nz_coeff_context and mutability_file and substitution_file: - if use_old_mut_parsimony: - mut_funcs = _mutability_dagfuncs( - mutability_file=mutability_file, - substitution_file=substitution_file, - splits=[] if chain_split is None else [chain_split], + return _isotype_dagfuncs() + + def prepare_context_poisson_funcs(): + if not (mutability_file and substitution_file): + raise RankingCriterionError( + "Poisson context likelihood loss requires mutability and substitution files." + ) + return _context_poisson_likelihood_dagfuncs( + mutability_file=mutability_file, + substitution_file=substitution_file, + splits=[] if chain_split is None else [chain_split], + ) + + def prepare_mutability_parsimony_funcs(): + if not (mutability_file and substitution_file): + raise RankingCriterionError( + "Mutability parsimony requires mutability and substitution files." + ) + return _mutability_dagfuncs( + mutability_file=mutability_file, + substitution_file=substitution_file, + splits=[] if chain_split is None else [chain_split], + ) + + def prepare_reversions_funcs(): + return _naive_reversion_dagfuncs(self._validation_stats["root_seq"]) + + ranking_function_keys = { + "B": prepare_bp_likelihood_funcs, + "I": prepare_isotype_parsimony_funcs, + "C": prepare_context_poisson_funcs, + "M": prepare_mutability_parsimony_funcs, + "A": _allele_dagfuncs, + "R": prepare_reversions_funcs, + } + + default_ranking_constructors = [ + prepare_bp_likelihood_funcs, + prepare_isotype_parsimony_funcs, + prepare_context_poisson_funcs, + _allele_dagfuncs, + ] + + # ####### BEGIN backward compatibility adjustments with old ways of specifying ranking + # ####### strategy: + if ( + (ranking_coeffs is not None) + or (branching_process_ranking_coeff != -1) + or use_old_mut_parsimony + ): + if ranking_strategy is None: + # Then we'll try to recover old behavior. + + warnings.warn( + "Arguments `branching_process_ranking_coeff`, `use_old_mut_parsimony`, and `ranking_coeffs` " + "are deprecated! Use `ranking_strategy` argument instead." + ) + if ranking_coeffs is not None: + if use_old_mut_parsimony: + criteria = ("B", "I", "M", "A") + max_to_min = (-1, 1, 1, 1) + else: + criteria = ("B", "I", "C", "A") + max_to_min = (-1, 1, -1, 1) + + coefficients = (branching_process_ranking_coeff,) + tuple( + ranking_coeffs + ) + ranking_strategy = "+".join( + [ + f"{coeff * mtm}{crit}" + for coeff, mtm, crit in zip( + coefficients, max_to_min, criteria + ) + if coeff != 0 + ] + ) + elif use_old_mut_parsimony: + default_ranking_constructors = [ + prepare_bp_likelihood_funcs, + prepare_isotype_parsimony_funcs, + prepare_mutability_parsimony_funcs, + _allele_dagfuncs, + ] + # in absence of ranking_coeffs, branching_process_ranking_coeff + # was ignored. + else: + warnings.warn( + "Arguments `branching_process_ranking_coeff`, `use_old_mut_parsimony`, and `ranking_coeffs` " + "are deprecated! They will be ignored, since a `ranking_strategy` was provided." ) + # ####### END backward compatibility for old ways of specifying ranking strategy + + lexicographic = True + # Parsing ranking_strategy, if provided: + if ranking_strategy: + ranking_strategy = ranking_strategy.replace(" ", "") + if ( + "," in ranking_strategy + or sum(key in ranking_strategy for key in ranking_function_keys) == 1 + ): + # Then we're doing lexicographic ranking + criteria = ranking_strategy.split(",") + + else: + lexicographic = False + criteria = ranking_strategy.replace("-", "+-").split("+") + # If there's a leading '-', or a '+-' already, then we'll get an empty + # string somewhere + criteria = [cr for cr in criteria if cr != ""] + + def expand_criterion(c): + if c[0] == "-": + fac = -1 + c = c[1:] + else: + fac = 1 + if len(c) == 1: + return c, fac + else: + return c[-1], fac * float(c[:-1]) + + try: + expanded_criteria = [ + expand_criterion(criterion) for criterion in criteria + ] + except Exception as e: + raise RankingCriterionError( + f"Error parsing ranking strategy string: '{ranking_strategy}'" + ) from e + + ranking_funcs_needed = set(name for name, _ in expanded_criteria) + initialized_ranking_funcs = { + name: ranking_function_keys[name]() for name in ranking_funcs_needed + } + all_filters = [ + (initialized_ranking_funcs[name], coeff) + for name, coeff in expanded_criteria + ] + # Account for lexicographic filters with coefficient of 0, which + # will be computed on trees but not used for ranking. + if lexicographic: + # Want all the observer filters to be at the end, so it's easy + # to deal with them in the tree_stats section. + dag_filters = [ + (filt, coeff) for filt, coeff in all_filters if coeff != 0 + ] + zero_filters = [ + (filt, coeff) for filt, coeff in all_filters if coeff == 0 + ] + observer_filters = dag_filters + zero_filters else: - mut_funcs = _context_poisson_likelihood_dagfuncs( - mutability_file=mutability_file, - substitution_file=substitution_file, - splits=[] if chain_split is None else [chain_split], + dag_filters = all_filters.copy() + observer_filters = all_filters.copy() + + else: + # Then we need to do defaults + dag_filters = [] + for cons in default_ranking_constructors: + try: + dagfunc = cons() + dag_filters.append((dagfunc, 1)) + except RankingCriterionError: + # We're just seeing what we have info to handle, so failing + # silently is fine + pass + observer_filters = dag_filters.copy() + + # Switch optimal functions based on signs of coefficients + adjusted_filters = [] + for dag_filter, coeff in dag_filters: + if coeff < 0: + warnings.warn( + f"Lower values for {dag_filter.weight_funcs.name} are generally better, but the " + "provided ranking coefficient is negative, so trees with higher values will be preferred." ) - dag_filters.append((mut_funcs, coeff_context)) + if lexicographic and dag_filter.optimal_func != min: + warnings.warn( + f"Unrecognized optimality function for {dag_filter.weight_funcs.name} may not support negation." + ) + dag_filter = hdag.utils.HistoryDagFilter(dag_filter.weight_funcs, max) + adjusted_filters.append((dag_filter, coeff)) + dag_filters = adjusted_filters + + dag = self._forest - # add allele funcs no matter what, for logging - allele_funcs = _allele_dagfuncs() - dag_filters.append((allele_funcs, coeff_alleles)) # add 0-returning functions so dagfuncs return tuples, even if allele funcs are the only ones used for filtering - dag_filters.append( - ( - hdag.utils.HistoryDagFilter( - hdag.utils.AddFuncDict( - { - "start_func": lambda n: 0, - "edge_weight_func": lambda n1, n2: 0, - "accum_func": lambda ls: 0, - }, - name="", - ), - min, - ordering_name="", - ), - 0, - ) + null_filter = hdag.utils.HistoryDagFilter( + hdag.utils.AddFuncDict( + { + "start_func": lambda n: 0, + "edge_weight_func": lambda n1, n2: 0, + "accum_func": lambda ls: 0, + }, + name="", + ), + min, + ordering_name="", ) + dag_filters.append((null_filter, 0)) + observer_filters.append((null_filter, 0)) combined_dag_filter = functools.reduce( lambda x, y: x + y, (dag_filter for dag_filter, _ in dag_filters) ) - if ranking_coeffs: - if len(ranking_coeffs) != 3: - raise ValueError( - "If ranking_coeffs are provided to `filter_trees` method, a list of three values is expected." - ) - for dag_filter, coeff in dag_filters: - if dag_filter.optimal_func == max and coeff > 0: - warnings.warn( - f"Higher values for {dag_filter.weight_funcs.name} are generally better, but the " - "provided ranking coefficient is positive, so trees with lower values will be preferred." - ) - if dag_filter.optimal_func == min and coeff < 0: - warnings.warn( - f"Lower values for {dag_filter.weight_funcs.name} are generally better, but the " - "provided ranking coefficient is negative, so trees with higher values will be preferred." - ) + combined_observer_filter = functools.reduce( + lambda x, y: x + y, (dag_filter for dag_filter, _ in observer_filters) + ) + if lexicographic: + ranking_dag_filter = combined_dag_filter + ranking_description = "Ranking trees to " + " then ".join( + opt_name[:3] + "imize " + ord_name + for (opt_name, _), ord_name in zip( + ranking_dag_filter.ordering_names, + ranking_dag_filter.weight_funcs.names, + ) + if ord_name != "" + ) + else: filtered_coefficients = [coeff for _, coeff in dag_filters] def linear_combinator(weighttuple): @@ -1321,65 +1514,101 @@ def linear_combinator(weighttuple): if coeff != 0 ) ) - else: - ranking_dag_filter = combined_dag_filter - ranking_description = "Ranking trees to " + " then ".join( - opt_name[:3] + "imize " + ord_name - for (opt_name, _), ord_name in zip( - ranking_dag_filter.ordering_names, - ranking_dag_filter.weight_funcs.names, - ) - if ord_name != "" - ) + if verbose: print(ranking_description) - def print_stats(statlist, title, file=None, suppress_score=False): - show_score = ranking_coeffs and not suppress_score + def print_stats( + statlist, + title, + file=None, + suppress_score=False, + filter_list=observer_filters, + ): + df_dict = {"Tree": list(range(1, len(statlist) + 1))} + df_dict.update( + { + dfilter.weight_funcs.name: vals + for (dfilter, _), vals in zip(filter_list, zip(*statlist)) + } + ) - def reformat(field, n=10): - if isinstance(field, int): - return format(field, "<" + str(n)) - else: - return f"{field:{n}.{n}}" + if (not lexicographic) and (not suppress_score): + df_dict["TreeScore"] = [ + linear_combinator(stattup) for stattup in statlist + ] - print("\n" + title + ":", file=file) - statstring = "\t".join( - tuple( - reformat(dfilter.weight_funcs.name, n=15) - for dfilter, _ in dag_filters[:-1] - ) - ) - print( - f"tree \t{statstring}" + ("\ttreescore" if show_score else ""), - file=file, - ) - for j, best_weighttuple in enumerate(statlist, 1): - # ignore always-0 entry at end: - statstring = "\t".join(reformat(it) for it in best_weighttuple[:-1]) + df_dict.pop("") + printing_df = pd.DataFrame(df_dict) + # Hide index + printing_df.index = [""] * len(statlist) + # Set display options without side-effects + with pd.option_context( + "display.max_rows", + None, + "display.max_columns", + 100, + "display.width", + 1000, + "display.max_colwidth", + None, + "display.show_dimensions", + False, + "display.colheader_justify", + "right", + ): + print(printing_df, file=file) + + trimdag = dag[ranking_dag_filter] + + trimmed_forest = self._trimmed_self(trimdag) + + if verbose: + if trimmed_forest.n_trees > 1: + n_topologies = trimmed_forest.n_topologies() print( - f"{j:<10}\t{statstring}" - + ( - f"\t{reformat(linear_combinator(best_weighttuple))}" - if show_score - else "" - ), - file=file, + "Degenerate ranking criteria: filtered forest contains " + f"{trimmed_forest.n_trees} unique trees, with {n_topologies} unique collapsed topologies." ) + if n_topologies > 10: + print( + "A representative from each of the ten most abundant topologies will be sampled randomly for rendering." + ) + else: + print( + "A representative of each topology will be sampled randomly for rendering." + ) - trimdag = dag[ranking_dag_filter] + random.seed(trimmed_forest.n_trees) + topoclasses = list(trimmed_forest.iter_topology_classes()) - best_weighttuple = trimdag.optimal_weight_annotate( - **combined_dag_filter, - ) + sampled_histories = [topoclass._forest.sample() for topoclass in topoclasses] + + weighttuples = [ + history.optimal_weight_annotate(**combined_observer_filter) + for history in sampled_histories + ] + + first_tree_weighttuple = weighttuples[0] + + ctrees = [ + topoclass._clade_tree_to_ctree(history) + for topoclass, history in zip(topoclasses, sampled_histories) + ] if verbose: - print_stats([best_weighttuple], "Stats for optimal trees") + if len(observer_filters) == len(dag_filters): + # Then all selected tree stats are the same, no need to print + # them multiple times: + print_stats(weighttuples[:1], "Stats for optimal trees") + else: + print_stats(weighttuples, "Stats for optimal trees") if summarize_forest: with open(outbase + ".forest_summary.log", "w") as fh: independent_best = [] - for dfilter, _ in dag_filters: + + for dfilter, _ in observer_filters[:-1]: tempdag = dag.copy() min_val, max_val = tempdag.weight_range_annotate(**dfilter) opt_weight = tempdag.trim_optimal_weight(**dfilter) @@ -1388,7 +1617,7 @@ def reformat(field, n=10): f"\nOverall {dfilter.weight_funcs.name} range {min_val} to {max_val}." f"\nAmong trees with {dfilter.optimal_func.__name__} {dfilter.weight_funcs.name} of: {opt_weight}\n" ) - for indfilter, _ in dag_filters: + for indfilter, _ in observer_filters[:-1]: if indfilter.weight_funcs.name != dfilter.weight_funcs.name: minval, maxval = tempdag.weight_range_annotate( **indfilter.weight_funcs @@ -1401,8 +1630,11 @@ def reformat(field, n=10): [ [ stat - best - for stat, best in zip(best_weighttuple, independent_best) + for stat, best in zip( + first_tree_weighttuple, independent_best + ) ] + + [0] ], "Highest ranked tree: loss from best value", file=fh, @@ -1410,33 +1642,49 @@ def reformat(field, n=10): ) if tree_stats: - dag_ls = list(dag.weight_count(**combined_dag_filter).elements()) + dag_ls = list(dag.weight_count(**combined_observer_filter).elements()) # To clear _dp_data fields of their large cargo dag.optimal_weight_annotate(edge_weight_func=lambda n1, n2: 0) - if ranking_coeffs: - minfunckey = linear_combinator + col_names = combined_observer_filter.weight_funcs.names + + if not lexicographic: + col_names += (ranking_strategy.replace(" ", ""),) + dag_ls = [tup + (linear_combinator(tup),) for tup in dag_ls] + best_tree_tup = first_tree_weighttuple + ( + linear_combinator(first_tree_weighttuple), + ) + + def minfunckey(tup): + return tup[-1] + else: - minfunckey = ranking_dag_filter.optimal_func + best_tree_tup = first_tree_weighttuple + _tuple_coeffs = [ + -1 if filt.optimal_func == max else 1 for filt, _ in dag_filters + ] + + def minfunckey(tup): + return tuple(coeff * it for coeff, it in zip(_tuple_coeffs, tup)) + dag_ls.sort(key=minfunckey) - df = pd.DataFrame( - dag_ls, columns=combined_dag_filter.weight_funcs.names - ).drop(columns=[""]) + df = pd.DataFrame(dag_ls, columns=col_names).drop(columns=[""]) df.to_csv(outbase + ".tree_stats.csv") df["set"] = ["all_trees"] * len(df) bestdf = pd.DataFrame( - [best_weighttuple], columns=combined_dag_filter.weight_funcs.names + [best_tree_tup], + columns=col_names, ) bestdf["set"] = ["best_tree"] toplot_df = pd.concat([df, bestdf], ignore_index=True) pplot = sns.pairplot( - toplot_df.drop(columns=["Alleles", ""], errors="ignore"), + toplot_df.drop(columns=[""], errors="ignore"), hue="set", diag_kind="hist", ) - pplot.savefig(outbase + ".tree_stats.pairplot.pdf") + pplot.savefig(outbase + f".tree_stats.pairplot.{img_type}") - return (self._trimmed_self(trimdag), best_weighttuple) + return (ctrees, trimmed_forest, weighttuples) def likelihood_rankplot(self, outbase, p, q, img_type="svg"): """Save a rank plot of likelihoods to the file @@ -1945,8 +2193,8 @@ def accum_func(cmsetlist: List[multiset.FrozenMultiset]): def _ll_genotype_dagfuncs(p: np.float64, q: np.float64) -> hdag.utils.HistoryDagFilter: - """Return functions for counting tree log branching process likelihood on - the history DAG. + """Return functions for counting tree negative log branching process + likelihood on the history DAG. For numerical consistency, we resort to the use of ``decimal.Decimal``. This is exactly for the purpose of solving the problem that float sum is @@ -1986,7 +2234,7 @@ def edge_weight_ll_genotype(n1: hdag.HistoryDagNode, n2: hdag.HistoryDagNode): if n1.is_root() and c == 0 and m == 1: # Add pseudocount for unobserved root unifurcation c = 1 - res = Decimal(CollapsedTree._ll_genotype(c, m, p, q)[0]) + res = Decimal(-CollapsedTree._ll_genotype(c, m, p, q)[0]) return hdag.utils.FloatState(float(round(res, 8)), state=res) def accum_func(weightlist): @@ -2000,9 +2248,9 @@ def accum_func(weightlist): "edge_weight_func": edge_weight_ll_genotype, "accum_func": accum_func, }, - name="LogBPLikelihood", + name="BPLikelihoodLogLoss", ), - max, + min, ) @@ -2029,3 +2277,40 @@ def _allele_dagfuncs() -> hdag.utils.HistoryDagFilter: ), min, ) + + +def _naive_reversion_dagfuncs(naive_seq: str) -> hdag.utils.HistoryDagFilter: + """Return functions for filtering trees in a history DAG by the number of + reversions to the naive nucleotide sequence. + + Args: + naive_seq: The naive sequence. Sitewise reversions to this sequence will be counted. + Returns: + A :meth:`historydag.utils.AddFuncDict` which may be passed as keyword arguments + to :meth:`historydag.HistoryDag.weight_count`, :meth:`historydag.HistoryDag.trim_optimal_weight`, + or :meth:`historydag.HistoryDag.optimal_weight_annotate` + methods to trim or annotate a :meth:`historydag.HistoryDag` according to allele count. + Weight format is ``int``. + """ + + def edge_weight_func(n1, n2): + if n1.is_ua_node(): + return 0 + else: + return sum( + 1 + for p, c, n in zip(n1.label.sequence, n2.label.sequence, naive_seq) + if (p != n and c == n) + ) + + return hdag.utils.HistoryDagFilter( + hdag.utils.AddFuncDict( + { + "start_func": lambda n: 0, + "edge_weight_func": edge_weight_func, + "accum_func": sum, + }, + name="NaiveReversions", + ), + min, + ) diff --git a/gctree/cli.py b/gctree/cli.py index 46f0ee72..904ef95b 100644 --- a/gctree/cli.py +++ b/gctree/cli.py @@ -171,9 +171,6 @@ def isotype_add(forest): if forest.n_trees == 1: warnings.warn("only one parsimony tree reported from dnapars") - if args.verbose: - print("number of trees with integer branch lengths:", forest.n_trees) - forest.mle(marginal=True) # Add isotypes to forest if args.isotype_mapfile: @@ -198,10 +195,13 @@ def isotype_add(forest): "The filename of a pickled history DAG object, or a phylipfile and abundance file, are required." ) + if args.verbose: + print("number of trees with integer branch lengths:", forest.n_trees) + # Filter the forest according to specified criteria, and along the way, # write a log file containing stats for all trees in the forest: - trimmed_forest, _ = forest.filter_trees( - ranking_coeffs=args.ranking_coeffs, + ctrees, _, _ = forest.filter_trees( + ranking_strategy=args.ranking_strategy, verbose=args.verbose, outbase=args.outbase, summarize_forest=args.summarize_forest, @@ -209,32 +209,12 @@ def isotype_add(forest): mutability_file=args.mutability, substitution_file=args.substitution, chain_split=args.chain_split, + img_type=args.img_type, + ranking_coeffs=args.ranking_coeffs, branching_process_ranking_coeff=args.branching_process_ranking_coeff, use_old_mut_parsimony=args.use_old_mut_parsimony, ) - if args.verbose: - if trimmed_forest.n_trees > 1: - n_topologies = trimmed_forest.n_topologies() - print( - "Degenerate ranking criteria: trimmed history DAG contains " - f"{trimmed_forest.n_trees} unique trees, with {n_topologies} unique collapsed topologies." - ) - if n_topologies > 10: - print( - "A representative from each of the ten most abundant topologies will be sampled randomly for rendering." - ) - else: - print( - "A representative of each topology will be sampled randomly for rendering." - ) - - random.seed(forest.n_trees) - ctrees = [ - topoclass.sample_tree() - for _, topoclass in zip(range(10), trimmed_forest.iter_topology_classes()) - ] - if args.colormapfile is not None: with open(args.colormapfile, "r") as f: colormap = {} @@ -273,6 +253,7 @@ def isotype_add(forest): chain_split=args.chain_split, frame2=args.frame2, position_map2=position_map2, + show_nuc_muts=args.show_nucleotide_mutations, ) collapsed_tree.newick(f"{args.outbase}.inference.{j}.nk") with open(f"{args.outbase}.inference.{j}.p", "wb") as f: @@ -288,6 +269,8 @@ def isotype_add(forest): plt.xlabel("genotype") plt.ylabel("abundance") plt.savefig(args.outbase + ".inference.abundance_rank." + args.img_type) + if args.verbose: + print() def simulate(args): @@ -482,7 +465,7 @@ def simulate(args): def get_parser(): parser = argparse.ArgumentParser( - description="genotype collapsed tree inference and simulation" + description="genotype collapsed tree inference and simulation", ) subparsers = parser.add_subparsers( title="subcommands", @@ -617,8 +600,9 @@ def get_parser(): type=float, default=-1, help=( + "This argument is deprecated. Use ``--ranking_strategy`` instead. " "Coefficient used for branching process likelihood, when ranking trees by a linear " - "combination of traits. This value will be ignored if `--ranking_coeffs` argument is not " + "combination of traits. This value will be ignored if ``--ranking_coeffs`` argument is not " "also provided." ), ) @@ -628,6 +612,7 @@ def get_parser(): nargs=3, default=None, help=( + "This argument is deprecated. Use ``--ranking_strategy`` instead. " "List of coefficients for ranking trees by a linear combination of traits. " "Coefficients are in order: isotype parsimony, mutation model parsimony, number of alleles. " "A coefficient of -1 will be applied to branching process likelihood. " @@ -635,28 +620,65 @@ def get_parser(): "isotype parsimony, and context-based Poisson likelihood in that order." ), ) + parser_infer.add_argument( + "--ranking_strategy", + type=str, + default=None, + help=( + "Expression describing tree ranking strategy. If provided, takes precedence over all other ranking arguments. " + "Two types of expressions are permitted: First are those describing lexicographic orderings, like ``B,C,A``, which means " + "choose trees to minimize branching process log loss, then minimize context log loss, then minimize number " + "of alleles. Next are expressions describing linear combinations of criteria, like ``B+2C-1.1A``, which means choose " + "trees to minimize the specified linear combination of criteria. " + "If linear combination expression has leading ``-``, use ``=`` instead of space to separate argument, " + "e.g. ``--ranking_strategy=-B+R``. " + "These two methods of ranking cannot be combined. For example, ``B+C,A`` is not a valid ranking strategy expression. " + "Ranking criteria are specified using the following identifiers. All are by default minimized:\n" + "B - branching process log loss,\n" + "I - isotype parsimony,\n" + "C - context-based Poisson log loss,\n" + "M - old mutability parsimony,\n" + "A - number of alleles,\n" + "R - sitewise reversions to naive sequence.\n" + "To compute the value of a criterion on ranked trees without affecting the ranking, include that ranking criterion " + "with a coefficient of zero, as in ``B+2C+0A``, or ``B,C,0A``.\n" + "To maximize instead of minimizing a criterion in lexicographic ranking, provide a negative coefficient. " + "For example, ``B,-A`` will first minimize branching process log loss, then maximize the number of alleles. \n" + "A ranking strategy string containing a single ranking criterion identifier will be interpreted as a lexicographic ordering. " + "``gctree infer --verbose`` will describe the ranking strategy used. Examine this output to make sure it's as expected." + ), + ) parser_infer.add_argument( "--use_old_mut_parsimony", action="store_true", help=( + "This argument is deprecated. Use the identifier 'M' with the " + "argument ``--ranking_strategy`` instead. " "Use old mutability parsimony instead of poisson context likelihood. Not recommended " "unless attempting to reproduce results from older versions of gctree. " "This argument will have no effect unless an S5F model is provided with the arguments " - "`--mutability` and `--substitution`." + "``--mutability`` and ``--substitution``." + ), + ) + parser_infer.add_argument( + "--show_nucleotide_mutations", + action="store_true", + help=( + "If provided, branches in rendered tree will be annotated with nucleotide mutations." ), ) parser_infer.add_argument( "--summarize_forest", action="store_true", help=( - "write a file `[outbase].forest_summary.log` with a summary of traits for trees in the forest." + "write a file ``[outbase].forest_summary.log`` with a summary of traits for trees in the forest." ), ) parser_infer.add_argument( "--tree_stats", action="store_true", help=( - "write a file `[outbase].tree_stats.log` with stats for all trees in the forest. " + "write a file ``[outbase].tree_stats.log`` with stats for all trees in the forest. " "For large forests, this is slow and memory intensive." ), ) diff --git a/gctree/isotyping.py b/gctree/isotyping.py index 15d25a1a..37d9193a 100644 --- a/gctree/isotyping.py +++ b/gctree/isotyping.py @@ -442,7 +442,7 @@ def edge_weight_func(n1: hdag.HistoryDagNode, n2: hdag.HistoryDagNode): "edge_weight_func": edge_weight_func, "accum_func": sum, }, - name="Isotype Pars.", + name="IsotypeParsimony", ), min, ) diff --git a/gctree/mutation_model.py b/gctree/mutation_model.py index 2a29ae38..6fdad98b 100644 --- a/gctree/mutation_model.py +++ b/gctree/mutation_model.py @@ -492,7 +492,7 @@ def distance(node1, node2): "edge_weight_func": distance, "accum_func": sum, }, - name="Mut. Pars.", + name="MutabilityParsimony", ), min, ) @@ -624,11 +624,11 @@ def _context_poisson_likelihood_dagfuncs(*args, splits: List[int] = [], **kwargs "edge_weight_func": lambda n1, n2: ( 0 if n1.is_ua_node() - else distance(n1.label.sequence, n2.label.sequence) + else -distance(n1.label.sequence, n2.label.sequence) ), "accum_func": sum, }, - name="LogContextLikelihood", + name="ContextLikelihoodLogLoss", ), - max, + min, ) diff --git a/tests/smalltest.sh b/tests/smalltest.sh index 5bbc6565..9011f4a8 100755 --- a/tests/smalltest.sh +++ b/tests/smalltest.sh @@ -5,20 +5,40 @@ export QT_QPA_PLATFORM=offscreen export XDG_RUNTIME_DIR=/tmp/runtime-runner export MPLBACKEND=agg mkdir -p tests/smalltest_output -wget -O HS5F_Mutability.csv https://bitbucket.org/kleinstein/shazam/raw/ba4b30fc6791e2cfd5712e9024803c53b136e664/data-raw/HS5F_Mutability.csv -wget -O HS5F_Substitution.csv https://bitbucket.org/kleinstein/shazam/raw/ba4b30fc6791e2cfd5712e9024803c53b136e664/data-raw/HS5F_Substitution.csv -gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel +mutabilities=HS5F_Mutability.csv +substitutions=HS5F_Substitution.csv -gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv --ranking_coeffs 0 1 0 --use_old_mut_parsimony --branching_process_ranking_coeff 0 +wget -O $mutabilities https://bitbucket.org/kleinstein/shazam/raw/ba4b30fc6791e2cfd5712e9024803c53b136e664/data-raw/HS5F_Mutability.csv +wget -O $substitutions https://bitbucket.org/kleinstein/shazam/raw/ba4b30fc6791e2cfd5712e9024803c53b136e664/data-raw/HS5F_Substitution.csv + +# testing backward compatibility: gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv --ranking_coeffs 1 1 0 --use_old_mut_parsimony --branching_process_ranking_coeff 0 -gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv --ranking_coeffs .01 -1 0 --branching_process_ranking_coeff -1 --summarize_forest --tree_stats +gctree infer tests/smalltest_output/gctree.infer.inference.parsimony_forest.p --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv --ranking_coeffs 1 -1 0 --use_old_mut_parsimony + +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv --use_old_mut_parsimony + +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv --ranking_coeffs -1 1 0 --use_old_mut_parsimony --branching_process_ranking_coeff 0 --ranking_strategy "M" + +# End testing backward compatibility ^^ + +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer1 --root GL --frame 1 --verbose --idlabel + +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer2 --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability $mutabilities --substitution $substitutions --ranking_strategy "M" + +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer3 --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability $mutabilities --substitution $substitutions --ranking_strategy "I+M" + +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer4 --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability $mutabilities --substitution $substitutions --ranking_strategy="-B+0.01I+C" --summarize_forest --tree_stats + + +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer5 --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer6 --root GL --frame 1 --verbose --idlabel --mutability $mutabilities --substitution $substitutions -gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer7 --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability $mutabilities --substitution $substitutions -gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv +gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer8 --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability $mutabilities --substitution $substitutions --ranking_strategy "R,-1.0B" -gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv +gctree infer tests/smalltest_output/gctree.infer.inference.parsimony_forest.p --outbase tests/smalltest_output/gctree.infer9 --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability $mutabilities --substitution $substitutions --ranking_strategy "A, 0B, 0R" --tree_stats --summarize_forest