Comments on the package #27

llrs · 2019-07-10T13:34:58Z

llrs
Jul 10, 2019

I came here because this post.

I don't know where (if) you plan to submit this to CRAN or Bioconductor. I would recommend Bioconductor for the topic of the package. But in that case you'll get a more through review if you submit to one of these repositories.

In any case, it seems that the package doesn't work well with other packages like phyloseq, or metagenomeSeq, or with other useful classes like SummarizedExperiment (used in Bioconductor to store data about a sequencing experiment). Doing so would help to use the package in existing pipelines/scripts.

Some functions would need more documentation of the parameters that they need and have some examples (at least that is a requirement for Bioconductor packages).

To get the error matrix, it would be perfect if we could distinguish what type of NA is a 0/0 (which imho for the purpose of the error matrix it should be then 0) or a 500/0.

In the vignette it is clearly explained how does the package work. It would be interesting to know how to use this information in other downstream analysis. Also it focus a lot on the tidy data frames, which might reduce the memory footprint of the data if it is very sparse but there are other solutions like data.table or Matrix, so I'm not sure if such an extensive space should be given to it in the tutorial.
The vignette focus on the error matrix and estimating bias, but I couldn't find any function to do it.

I've seen the tests and they should be more minimal, include just the data and the tests (you can create and have data just for tests). But at the same time it should test more than just the center function.

Many thanks for tacking the effort to create this nice package. I'm sure it will be very well received by the community.

mikemc · 2019-08-24T20:09:35Z

mikemc
Aug 24, 2019
Maintainer

Thanks for your comments! (And sorry for the delayed response.) I'm going to record my thoughts here about the points you raised for future reference, and will start separate issues in the future pertaining to each issue as I get around to working on it.

First, a very high-level comment: This package is currently serving to perform the analysis for our manuscript, and the vignette gives an explanation of how this analysis is done and could be done on a new dataset. Our manuscript is a basic-research paper rather than a software or methods paper and hence why we have not developed user-friendly high-level interfaces that e.g. work with phyloseq and haven't submitted to a repository. We still have some work to do to develop more robust statistical inference methods and practical guidance on how to use estimate bias and perform calibration in the wider range of experiments that microbiologists face in practice. As we work on those, I will be experimenting with user-interfaces such as an estimate_bias() function that can take phyloseq and/or SummarizedExperiment objects.

I don't know where (if) you plan to submit this to CRAN or Bioconductor. I would recommend Bioconductor for the topic of the package.

I'm also imagining that Bioconductor would be ideal once we meet the above goals and add the required integration and documentation.

In any case, it seems that the package doesn't work well with other packages like phyloseq, or metagenomeSeq, or with other useful classes like SummarizedExperiment (used in Bioconductor to store data about a sequencing experiment). Doing so would help to use the package in existing pipelines/scripts.

I agree and to begin with, I plan to add a phyloseq interface to a bias estimation function, since I'm most familiar with phyloseq and I expect most of our target users would be as well. A downside of phyloseq is that it is not possible for a phyloseq object to hold both the observed abundances and the known abundances for the control samples in a natural way, and so it is necessary to have two phyloseq and/or otu_table objects, for the observed and actual abundances. In contrast I think a single SummarizedExperiment object could include both tables.

Some functions would need more documentation of the parameters that they need and have some examples (at least that is a requirement for Bioconductor packages).

Agreed. This wasn't necessary for the paper (since the manuscript ultimately is the documentation) but I will be expanding the documentation over the next 1-2 months.

To get the error matrix, it would be perfect if we could distinguish what type of NA is a 0/0 (which imho for the purpose of the error matrix it should be then 0) or a 500/0.

Once added, the bias-estimation function will abstract out the need for the user to understand the error matrix. But to answer your question, the error matrix arises from dividing the observed abundances by the actual abundances. If a taxon is not in the sample and it is not observed, that results in a 0/0, or NaN in R, and hence these are the entries that are allowed to appear in the error matrix. A 0/0 is used here because the value is undefined, not 0. More explanation is given in Appendix 2 in the second version of our preprint.

In the vignette it is clearly explained how does the package work. It would be interesting to know how to use this information in other downstream analysis.

Agreed. The section "Bias measurement as a means of evaluating and improving protocols" of our manuscript illustrates some ways in which bias estimation can be fruitfully applied to "quality control" experiments, and it would be useful to have a more step-by-step guide to the analysis of that section. More generally I hope to add vignettes and/or blog posts illustrating various other applications (many of which we outline in the Discussion) as we develop and apply them in our work and learn more from microbiologists about their needs.

Also it focus a lot on the tidy data frames...I'm not sure if such an extensive space should be given to it in the tutorial. The vignette focus on the error matrix and estimating bias, but I couldn't find any function to do it.

The reason for this is a bit historical (for lack of a better word). I patterned this first version of the tutorial on what I actually do in the manuscript, so that someone looking at the analysis could follow along (including future me). E.g., the center() function is used to estimate bias, since we define the estimate of bias as the "center" or compositional mean of the error matrix. Once I add higher-level "estimate bias" and "calibration" functions I will make a new tutorial that uses these and skips low-level details of how the calculation of bias and calibration are done.

I've seen the tests and they should be more minimal, include just the data and the tests (you can create and have data just for tests). But at the same time it should test more than just the center function.

I think the speed penalty for creating the simulated datasets for the tests is insignificant, so I'm not sure about the need to store them externally. But I would like to add tests for more functions.

0 replies

mikemc · 2020-10-22T11:53:02Z

mikemc
Oct 22, 2020
Maintainer

Quick update on some of the above issues: I've finally gotten around to adding an easier-to-use set of functions for estimation and calibration, which works with matrices or phyloseq objects rather than tidy data frames. These functions still need better documentation but I've updated the tutorial to show how to use them: https://mikemc.github.io/metacal/articles/tutorial.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments on the package #27

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Comments on the package #27

llrs Jul 10, 2019

Replies: 2 comments

mikemc Aug 24, 2019 Maintainer

mikemc Oct 22, 2020 Maintainer

llrs
Jul 10, 2019

mikemc
Aug 24, 2019
Maintainer

mikemc
Oct 22, 2020
Maintainer