Skip to content
This repository has been archived by the owner on Nov 20, 2019. It is now read-only.

handler not triggering resolve action for an event transitioned from crit -> warn -> ok #103

Closed
cabecada opened this issue Dec 28, 2016 · 4 comments

Comments

@cabecada
Copy link
Contributor

TL; DR

It looks like sensu does not filter events to handlers when it is resolved like
crit -> warn -> ok (in some cases, see below)

but when it does filter events to handlers when it is resolved like
crit -> ok

sensu while transitioning from action = create(crit) to action = create (warn), will reset the occurrences to 1.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549

sensu while transitioning from action = create(crit) to action = resolve(ok), will retain the occurrences from create.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L536

given a scenario

interval: 10
alert_after: 40
initial_failing_occurrence = 4 (from https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231)

simulate trigger incident

occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger

occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger

occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger

occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)

#crit -> warn
#occurrences set to 1 after status = 1
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549

occurrences = 1
action = create
status = 1
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

simulate resolve incident
#immediately warn -> ok
#occurrences retained from warn after status = 0
occurrences = 1
action = resolve
status = 0
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

after this event is deleted and as a result sensu handler does not filter the event to PD handler (to resolve incident),
as a result orphaned PD incidents.

whereas (simulate where occurrences is not yet reset to 1)

simulate trigger incident

occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger

occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger

occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger

occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)

occurrences = 6
action = create
status = 2
number_of_failed_attempts = 6 - 4 = 2 (!<1) -> do not trigger until next alert_after …and so on.

simulate resolve incident
#crit -> ok
#occurrences retained from crit after status = 0
occurrences = 6
action = resolve
status = 0
number_of_failed_attempts = 6 - 4 = 1 (!<1) -> trigger (resolve PD !!)

Not sure i was able to explain this clearly, it is mostly code references.

But i guess this is the reason why we have had some cases where PD incidents were not cleared from pager duty even though these were resolved from PD.

fix:

https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231
we short circuit filter_repeated to handler when action: resolve.

@solarkennedy
Copy link
Contributor

Yea, we've "known" about this for a while. We've been "working around" it by simply not sending checks that warn to have page => true.

You can't short-circuit it too aggressively, otherwise users will get irc and email "resolves" when they never got the original crit?

I'm glad you have the tests cases. If you can add the warn->ok test case and make it pass, I would accept a pr!

@fessyfoo
Copy link
Contributor

fessyfoo commented Dec 29, 2016

@solarkennedy is right. you cannot always handle the resolve event. so you have to be careful.
or you will sometimes handle a resolve event where no create event was handled. (ie the whole thing was intended to be filtered because not enough occurrences)

upstream sensu added occurrences_watermark to deal with the subtleties and implemented use of it in the occurrences extension. They've also moved filter_repeated out of sensu-plugin, which makes this whole issue harder to track. ;)

Yelp/sensu_handlers implements a different occurrences filtering algorithm than sensu-plugins or now sensu-extentions-occurrences, one that includes exponential backoff and different controlling properties in the events.

An appropriate thing to do would be implement a occurrences_watermark based filter extension as counter part to https://github.com/sensu-extensions/sensu-extensions-occurrences

a work around would be to just update sensu_handlers filter_repeated to do the occurrences_watermark based filtering like sensu-extensions-occurrences

another description of the issue as it was in sensu sensu/sensu-extensions#11 (comment)

@fessyfoo
Copy link
Contributor

Yea, we've "known" about this for a while.

on 2016-08-23 fess commented on the fix in sensu-extensions-occurrences:

the irony here is i've been thinking about this occurrences issue with you, but we're using Yelp/sensu_handlers, so we're going to need to implement a similar solution for deprecating the overridden filter_repeated. ( /cc @solarkennedy fyi )

cabecada pushed a commit to cabecada/sensu_handlers that referenced this issue Mar 15, 2017
Yelp#103
handler should now fire a resolve if the event was triggered
for create action earlier.
handler should not fire a resolve if the event was not triggered
for create action earlier.
@fessyfoo
Copy link
Contributor

a work around would be to just update sensu_handlers filter_repeated to do the occurrences_watermark based filtering like sensu-extensions-occurrences

note #106

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants