-
Notifications
You must be signed in to change notification settings - Fork 31
handler not triggering resolve action for an event transitioned from crit -> warn -> ok #103
Comments
Yea, we've "known" about this for a while. We've been "working around" it by simply not sending checks that warn to have You can't short-circuit it too aggressively, otherwise users will get irc and email "resolves" when they never got the original crit? I'm glad you have the tests cases. If you can add the warn->ok test case and make it pass, I would accept a pr! |
@solarkennedy is right. you cannot always handle the resolve event. so you have to be careful. upstream sensu added
An appropriate thing to do would be implement a a work around would be to just update sensu_handlers filter_repeated to do the another description of the issue as it was in sensu sensu/sensu-extensions#11 (comment) |
on 2016-08-23 fess commented on the fix in sensu-extensions-occurrences:
|
Yelp#103 handler should now fire a resolve if the event was triggered for create action earlier. handler should not fire a resolve if the event was not triggered for create action earlier.
note #106 |
TL; DR
It looks like sensu does not filter events to handlers when it is resolved like
crit -> warn -> ok (in some cases, see below)
but when it does filter events to handlers when it is resolved like
crit -> ok
sensu while transitioning from action = create(crit) to action = create (warn), will reset the occurrences to 1.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549
sensu while transitioning from action = create(crit) to action = resolve(ok), will retain the occurrences from create.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L536
given a scenario
interval: 10
alert_after: 40
initial_failing_occurrence = 4 (from https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231)
simulate trigger incident
occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger
occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger
occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger
occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger
occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)
#crit -> warn
#occurrences set to 1 after status = 1
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549
occurrences = 1
action = create
status = 1
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger
simulate resolve incident
#immediately warn -> ok
#occurrences retained from warn after status = 0
occurrences = 1
action = resolve
status = 0
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger
after this event is deleted and as a result sensu handler does not filter the event to PD handler (to resolve incident),
as a result orphaned PD incidents.
whereas (simulate where occurrences is not yet reset to 1)
simulate trigger incident
occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger
occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger
occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger
occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger
occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)
occurrences = 6
action = create
status = 2
number_of_failed_attempts = 6 - 4 = 2 (!<1) -> do not trigger until next alert_after …and so on.
simulate resolve incident
#crit -> ok
#occurrences retained from crit after status = 0
occurrences = 6
action = resolve
status = 0
number_of_failed_attempts = 6 - 4 = 1 (!<1) -> trigger (resolve PD !!)
Not sure i was able to explain this clearly, it is mostly code references.
But i guess this is the reason why we have had some cases where PD incidents were not cleared from pager duty even though these were resolved from PD.
fix:
https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231
we short circuit filter_repeated to handler when action: resolve.
The text was updated successfully, but these errors were encountered: