You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an important part of the project as it aims at determining the quality of the output data products (in this case it should be only one AFAIK: the background model).
This issue should be used to discuss and design what should be a first set of benchmarks that we want to use to judge if the output of pybkgmodel is healthy.
Definition and context
For me a benchmark is a qualitative (a plot) or a quantitive (metric) validation tool applied on the output data.
This is different than a unit or integration test which instead aim at verifying that the software which produces the output essentially doesn't crash or behave in an abnormal way (see #16 for that part).
It is of course hard to define a unique set of benchmarks that accomodate every use-case, so I would propose to divide them into two groups: common and method-specific.
(Proposed) Technical implementation
This list can be applied in parallel to each benchmark by the team to make it quicker.
Input test data
This should be stored somewhere that is accessible by the GitHub CI and depending on the size of the input data it might be necessary to use something else than the GitHub-hosted runners, so if someone has a machine available we can install a self-hosted runner store the input data there and execute the CI.
define the benchmark
gather its definition from existing published work (e.g. Berge et al. 2007, Vovk et al. 2028, etc...) or propose it in a new issue in this repository
define a plotting function
Such a function should be defined as input-agnostic as possible in order to avoid changing it if the input data or format changes, so e.g. use direct physical quantities as input and not containers.
To collect such functions we can add a plotting module to the package.
implement the benchmark
I would propose to have a small set of Jupyter notebooks (1 for the common benchmarks + 1 per background-generation method) synced to Python scripts with JupyText and run in a parametrized way with e.g. papermill.
Each notebook should load the required input, reduce the data if necessary and execute the validation tool.
Since the input data is fixed we should define some more or less conservative benchmark values to test against (e.g. if the deviation of the significance distribution on an empty field differs by more than 1% from the expected normal distribution the benchmark triggers a failure).
Continuous Integration
The benchmark suite should run on each PR unless irrelevant (e.g. a README update).
Unit and integration tests have priority over benchmarks of course, as if the modifications to the software trigger a crash, there is not point in trying to run it on tens of hours of data.
The text was updated successfully, but these errors were encountered:
This is an important part of the project as it aims at determining the quality of the output data products (in this case it should be only one AFAIK: the background model).
This issue should be used to discuss and design what should be a first set of benchmarks that we want to use to judge if the output of pybkgmodel is healthy.
Definition and context
For me a benchmark is a qualitative (a plot) or a quantitive (metric) validation tool applied on the output data.
This is different than a unit or integration test which instead aim at verifying that the software which produces the output essentially doesn't crash or behave in an abnormal way (see #16 for that part).
It is of course hard to define a unique set of benchmarks that accomodate every use-case, so I would propose to divide them into two groups: common and method-specific.
(Proposed) Technical implementation
This list can be applied in parallel to each benchmark by the team to make it quicker.
This should be stored somewhere that is accessible by the GitHub CI and depending on the size of the input data it might be necessary to use something else than the GitHub-hosted runners, so if someone has a machine available we can install a self-hosted runner store the input data there and execute the CI.
gather its definition from existing published work (e.g. Berge et al. 2007, Vovk et al. 2028, etc...) or propose it in a new issue in this repository
Such a function should be defined as input-agnostic as possible in order to avoid changing it if the input data or format changes, so e.g. use direct physical quantities as input and not containers.
To collect such functions we can add a plotting module to the package.
I would propose to have a small set of Jupyter notebooks (1 for the common benchmarks + 1 per background-generation method) synced to Python scripts with JupyText and run in a parametrized way with e.g. papermill.
Each notebook should load the required input, reduce the data if necessary and execute the validation tool.
Since the input data is fixed we should define some more or less conservative benchmark values to test against (e.g. if the deviation of the significance distribution on an empty field differs by more than 1% from the expected normal distribution the benchmark triggers a failure).
The benchmark suite should run on each PR unless irrelevant (e.g. a README update).
Unit and integration tests have priority over benchmarks of course, as if the modifications to the software trigger a crash, there is not point in trying to run it on tens of hours of data.
The text was updated successfully, but these errors were encountered: