-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea: Margin of error for coverage thresholds #375
Comments
If the result will be the same, i.e. an alert, I would rather lower 0.95 to 0.9 than add a new setting that effectively does the same. I would understand if we wanted to have different levels of alerts. If we want to make 0.9-0.94 a warning and 0.89 and lower an error, then a threshold makes sense to me. Although it could be implemented as an absolute value (e.g. 0.9) rather than a relative one (0.05). |
This is what I did on my jobs.
You still have to calculate this even if you reduced it to a single variable.
I don't agree with this approach. If the job has a high variability, the lower tolerance value should be calculated before the job goes into production. If the variability starts after it was already in production, then the monitors needs to be triggered to alert of the change, so developers can evaluate the new situation and decide if the previous tolerances are still valid of if they need to figure out new ones.
This is a dangeours approach as it would allow for a slow degradation over time that is not picked by any monitor |
Thanks for the replies! Maybe we could have different severity levels for those alerts, and developers could set how each severity level is forwarded to them. But I imagine that would require careful consideration for how to implement it so monitors can support them, as well as some analysis of which other cases could benefit from it. I like that idea though! I wonder if we could approach this threshold problem another way? Maybe we could develop some tools to help calculate and set those thresholds more easily, as it seems to be a common problem among internal projects.
There are ways to mitigate this, like comparing against the average from different timeframes (last job, last month, last year). But it does indeed open the door to missing gradual degradations, but maybe that's a risk some projects might be fine to take. I wouldn't use that approach to replace the current coverage one, but maybe it's a viable alternative that could be offered in parallel to this one. |
I think this can be solved by using a custom monitor that inherits from the default coverage monitor. Different Monitor suites can take care of the different alerting levels. I will close this issue as I don't think we actually need to change Spidermon core just for this, please reopen if you think otherwise. |
Reopening this ticket after our discussions with Victor, in which users might find this setting helpful when they have several coverage thresholds and might not want to update them one by one every time they need to adjust the tolerance margin. |
Background
FieldCoverageMonitor is configured via the
SPIDERMON_FIELD_COVERAGE_RULES
setting. This dictionary defines how much coverage we expect from each field (expressed as a float number between 0 and 1, 1 being 100% coverage).These thresholds are sometimes too strict. For example, if we set up a field coverage of 95% (0.95) on a field, if the actual field coverage ends up at 94.9%, we'll get an alert from Spidermon. Those kinds of close coverage drops aren't usually an issue, and devs tend to ignore them, but they still trigger alerts that get mixed with more severe errors.
Proposal
We could allow a certain margin of error or tolerance over these thresholds. Say, for example, a ±5% allowance, which wouldn't trigger an alert for the case defined before.
This could also be configured via a general setting for all fields, though a field-level defined tolerance could exist.
Alternatives
We've discussed an alternative to solve this: to lower the existing thresholds. This should work, but in practice, sometimes it is hard to calculate precisely how much it should be reduced, and jobs with high coverage variability require constant updates. A new tolerance setting might need to be updated too from time to time but could give more flexibility.
Another solution could be to use past jobs to calculate the expected thresholds dynamically. For example, in one of our projects, we don't have fixed coverage thresholds, but thresholds for how much we allow the coverage to be reduced between jobs. However, this is an entirely different approach compared to FieldCoverageMonitor (it depends on ScrapyCloud, and FieldCoverageMonitor doesn't), so this could co-exist as another monitor.
Let me know what you think! We've been discussing this in the monitoring chapter at Zyte and thought this could be beneficial for a lot of existing projects /cc @Chandral @mrwbarg
The text was updated successfully, but these errors were encountered: