Refine eval tests #5

jerch · 2018-11-03T15:52:33Z

Currently the eval test relies on a relative tolerance range for the basic descriptive statistical values. This basic statistical approach is working for a first starter, but also neglects several problems:

systematic errors due to system load fluctuations, maybe other factors?
cloud/CI based testing
uncertainty calculation/propagation
preconditions for basic statistical test conditions

1.) Could be somewhat levelled out by spotting the system load between runs and either limiting valid tests to the load or calculate it as covariant. The latter will imho not work, since system load itself is an aggregate of several "health conditions" of the system, in particular a certain test might not be dependent to the system load but some other system state, that has unknown impact on the system load (io-load, network bandwidth, disk throughput etc.)
Possible solution: From common sense system load has typically a high impact on performance measuring, it might be enough to introduce it as a ranged precondition (test will be rejected if the load differs to much). To keep the test setup simple other impacting system states should be treated as S.E.P. - means its the test creators responsibility to make sure that a particular condition has not changed to much between baseline and eval run.
2.) It also seems questionable whether to do cloud/CI based perf testing at all, as the tests might compare results from totally different envs/hardware. Not sure yet how to spot those conditions to reject the test execution. Furthermore if a CI can ensure the same env/hardware, the test should not be rejected.
3.) Not sure yet if it is feasible to do that, it would basically depend on the way the value was measured, how often, and the analysis model. For more reliable statistical tests (t-test, f-test and such) it will need tons of iterations (runtime will explode) and preconditions checks (3.), proside - once we ruled out covariants and correlations we have a highly certain result with a p-value.
4.) Some basic descriptive values are missing to do reliable testing with the right distribution model - at least missing: skewness. To simplify things we could simply reject testing if the descriptive values do not meet the distribution model used for analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine eval tests #5

Refine eval tests #5

jerch commented Nov 3, 2018

Refine eval tests #5

Refine eval tests #5

Comments

jerch commented Nov 3, 2018