Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine eval tests #5

Open
jerch opened this issue Nov 3, 2018 · 0 comments
Open

Refine eval tests #5

jerch opened this issue Nov 3, 2018 · 0 comments

Comments

@jerch
Copy link
Member

jerch commented Nov 3, 2018

Currently the eval test relies on a relative tolerance range for the basic descriptive statistical values. This basic statistical approach is working for a first starter, but also neglects several problems:

  1. systematic errors due to system load fluctuations, maybe other factors?
  2. cloud/CI based testing
  3. uncertainty calculation/propagation
  4. preconditions for basic statistical test conditions

1.) Could be somewhat levelled out by spotting the system load between runs and either limiting valid tests to the load or calculate it as covariant. The latter will imho not work, since system load itself is an aggregate of several "health conditions" of the system, in particular a certain test might not be dependent to the system load but some other system state, that has unknown impact on the system load (io-load, network bandwidth, disk throughput etc.)
Possible solution: From common sense system load has typically a high impact on performance measuring, it might be enough to introduce it as a ranged precondition (test will be rejected if the load differs to much). To keep the test setup simple other impacting system states should be treated as S.E.P. - means its the test creators responsibility to make sure that a particular condition has not changed to much between baseline and eval run.
2.) It also seems questionable whether to do cloud/CI based perf testing at all, as the tests might compare results from totally different envs/hardware. Not sure yet how to spot those conditions to reject the test execution. Furthermore if a CI can ensure the same env/hardware, the test should not be rejected.
3.) Not sure yet if it is feasible to do that, it would basically depend on the way the value was measured, how often, and the analysis model. For more reliable statistical tests (t-test, f-test and such) it will need tons of iterations (runtime will explode) and preconditions checks (3.), proside - once we ruled out covariants and correlations we have a highly certain result with a p-value.
4.) Some basic descriptive values are missing to do reliable testing with the right distribution model - at least missing: skewness. To simplify things we could simply reject testing if the descriptive values do not meet the distribution model used for analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant