Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking #21

Open
HealthyPear opened this issue Jul 23, 2024 · 0 comments
Open

Benchmarking #21

HealthyPear opened this issue Jul 23, 2024 · 0 comments
Labels
benchmarking enhancement New feature or request help wanted Extra attention is needed

Comments

@HealthyPear
Copy link
Member

This is an important part of the project as it aims at determining the quality of the output data products (in this case it should be only one AFAIK: the background model).

This issue should be used to discuss and design what should be a first set of benchmarks that we want to use to judge if the output of pybkgmodel is healthy.

Definition and context

For me a benchmark is a qualitative (a plot) or a quantitive (metric) validation tool applied on the output data.
This is different than a unit or integration test which instead aim at verifying that the software which produces the output essentially doesn't crash or behave in an abnormal way (see #16 for that part).

It is of course hard to define a unique set of benchmarks that accomodate every use-case, so I would propose to divide them into two groups: common and method-specific.

(Proposed) Technical implementation

This list can be applied in parallel to each benchmark by the team to make it quicker.

  1. Input test data

This should be stored somewhere that is accessible by the GitHub CI and depending on the size of the input data it might be necessary to use something else than the GitHub-hosted runners, so if someone has a machine available we can install a self-hosted runner store the input data there and execute the CI.

  1. define the benchmark

gather its definition from existing published work (e.g. Berge et al. 2007, Vovk et al. 2028, etc...) or propose it in a new issue in this repository

  1. define a plotting function

Such a function should be defined as input-agnostic as possible in order to avoid changing it if the input data or format changes, so e.g. use direct physical quantities as input and not containers.

To collect such functions we can add a plotting module to the package.

  1. implement the benchmark

I would propose to have a small set of Jupyter notebooks (1 for the common benchmarks + 1 per background-generation method) synced to Python scripts with JupyText and run in a parametrized way with e.g. papermill.

Each notebook should load the required input, reduce the data if necessary and execute the validation tool.
Since the input data is fixed we should define some more or less conservative benchmark values to test against (e.g. if the deviation of the significance distribution on an empty field differs by more than 1% from the expected normal distribution the benchmark triggers a failure).

  1. Continuous Integration

The benchmark suite should run on each PR unless irrelevant (e.g. a README update).

Unit and integration tests have priority over benchmarks of course, as if the modifications to the software trigger a crash, there is not point in trying to run it on tens of hours of data.

@HealthyPear HealthyPear added enhancement New feature or request help wanted Extra attention is needed benchmarking labels Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarking enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant