Benchmarking #21

HealthyPear · 2024-07-23T13:27:03Z

This is an important part of the project as it aims at determining the quality of the output data products (in this case it should be only one AFAIK: the background model).

This issue should be used to discuss and design what should be a first set of benchmarks that we want to use to judge if the output of pybkgmodel is healthy.

Definition and context

For me a benchmark is a qualitative (a plot) or a quantitive (metric) validation tool applied on the output data.
This is different than a unit or integration test which instead aim at verifying that the software which produces the output essentially doesn't crash or behave in an abnormal way (see #16 for that part).

It is of course hard to define a unique set of benchmarks that accomodate every use-case, so I would propose to divide them into two groups: common and method-specific.

(Proposed) Technical implementation

This list can be applied in parallel to each benchmark by the team to make it quicker.

Input test data

This should be stored somewhere that is accessible by the GitHub CI and depending on the size of the input data it might be necessary to use something else than the GitHub-hosted runners, so if someone has a machine available we can install a self-hosted runner store the input data there and execute the CI.

define the benchmark

gather its definition from existing published work (e.g. Berge et al. 2007, Vovk et al. 2028, etc...) or propose it in a new issue in this repository

define a plotting function

Such a function should be defined as input-agnostic as possible in order to avoid changing it if the input data or format changes, so e.g. use direct physical quantities as input and not containers.

To collect such functions we can add a plotting module to the package.

implement the benchmark

I would propose to have a small set of Jupyter notebooks (1 for the common benchmarks + 1 per background-generation method) synced to Python scripts with JupyText and run in a parametrized way with e.g. papermill.

Each notebook should load the required input, reduce the data if necessary and execute the validation tool.
Since the input data is fixed we should define some more or less conservative benchmark values to test against (e.g. if the deviation of the significance distribution on an empty field differs by more than 1% from the expected normal distribution the benchmark triggers a failure).

Continuous Integration

The benchmark suite should run on each PR unless irrelevant (e.g. a README update).

Unit and integration tests have priority over benchmarks of course, as if the modifications to the software trigger a crash, there is not point in trying to run it on tens of hours of data.

HealthyPear added enhancement New feature or request help wanted Extra attention is needed benchmarking labels Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking #21

Benchmarking #21

HealthyPear commented Jul 23, 2024

Benchmarking #21

Benchmarking #21

Comments

HealthyPear commented Jul 23, 2024

Definition and context

(Proposed) Technical implementation