-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSK-1279] More rigorous evaluation of significance of performance metrics #1162
Comments
This was mostly adressed in #1193, althought the Benjamini–Hochberg procedure is not enabled by default (because statistical tests on metrics like balanced accuracy pose problems). |
Not completed yet |
Hello, It's KD_A from Reddit. I purged my account recently, so the linked Reddit comment is no longer available. Posting it and the next reply here for posterity: First replyThanks for the response. I realized I misphrased the problem as multiple testing. It's more accurate to categorize it as selection bias: if 100s of slice+metric combinations are examined, then the observed worst n drops from the global average (where n is kind of small) are likely overestimates. The degree of overestimation gets worse as the rank of the drop gets closer to 1. See the intro of this paper (which also contains a bias-corrected estimator):
Given this fact, my main concern as a user would be how much I should trust the alerts. Have Giskard's alerts and estimates been empirically evaluated? For example, for alerts, what's the probability that a drop is practically significant/worrisome given that Giskard alerted on it? One way to answer this question is to split off another large test set, and evaluate Giskard's alerts (from an independent test set) on it.
2 potential concerns:
I'm not advocating for displaying hypothesis test results to users. But I do think that running good testing procedures in the background will help in filtering out false alerts.
In case you end up going down this route again, the Benjamini-Hochberg procedure is a super easy and fast way to control the false discovery/alert rate. It seems more applicable to Giskard than sequential correction procedures. Second reply
A test for relative difference in (mean) score could work. Assuming higher scores are better: H0: (complement score - slice score)/(complement score) = 1/5 H1: (complement score - slice score)/(complement score) > 1/5 The null value, 1/5, was chosen assuming that the user only cares about differences where the model performs 80% as well (or worse) on the selected slice as it does on the complement. Feel free to decrease it to e.g., 1/10, b/c there's some tolerance for false positives. Avoid worrying about analytically computing the distribution of the test statistic by running a permutation test. All you have to do is supply a function which computes the relative difference in means as the Everything else you mentioned makes sense. Thank you for the discussion! |
Following the feedback by user KD_A on reddit. They recommend more sound handling of statistical significance to prevent selection bias, in particular using a Benjamini-Hochberg procedure to control the false discovery rate.
The problem is that we currently test several data slice candidates + metric without accounting for selection bias → this can lead to a high number of false positive detections.
To do
PerformanceBiasDetector
and filter the detections based on their p-value with Benjamini-Hochberg procedure.From SyncLinear.com | GSK-1279
The text was updated successfully, but these errors were encountered: