exometer_proc timeout #99

iguberman · 2017-10-20T18:45:31Z

This started happening when we started testing the system under high load. Never seen this problem until then, and it happens a few minutes into our high-load test. The problem is, exometer never recovers after that and we don't get any stats anymore. I tried significantly increasing this timeout in our fork of exometer_core, but that didn't help.

2017-10-20 17:34:35.715 [error] <0.702.0> CRASH REPORT Process exometer_report with 0 neighbours crashed with reason: timeout in exometer_proc:call/2 line 131
2017-10-20 17:34:35.780 [error] <0.699.0> Supervisor exometer_core_sup had child exometer_report started with exometer_report:start_link() at <0.702.0> exit with reason timeout in exometer_proc:call/2 line 131 in context child_terminated```

The text was updated successfully, but these errors were encountered:

uwiger · 2017-10-21T15:37:42Z

Hi Irina!

Odd that it's the exometer_report process that crashes in an exometer_proc:call/2 call – it should never make such a call in the first place. The second problem is of course that it doesn't restore operation after restarting.

Can you reproduce the problem?

uwiger · 2017-10-21T15:58:57Z

Hmm, I take that back (well, not the "should never" ...). In do_report/2 [handle_info(report_batch, ...)], which is called from the handle_info() clauses for report and report_batch, exometer_report calls exometer:get_value/2, which may end up doing an exometer_proc() call to a probe, which would use exometer_proc:call().

Chances are then, that one of your probes becomes unresponsive. Of course, the exometer_report process should not crash because of that – it should not even be blocked. I'll have to look into that.

iguberman · 2017-10-21T23:45:36Z

Thanks so much Ulf!

That gives me something to look into on my end as well. Unfortunately I could reproduce it every time I run high load test in our server and pretty soon, right around the time when I need to start examining the stats, exometer_proc crashes and the stats disappear.

What I did however as a temp workaround I removed timeout completely from exometer_proc to receive without any timeout and now it doesn't crash at all (possibly I'm thus hiding the real cause of original issue under the rug).

    MRef = erlang:monitor(process, Pid),
    Pid ! {exometer_proc, {self(), MRef}, Req},
    receive
        {MRef, Reply} ->
            erlang:demonitor(MRef, [flush]),
            Reply;
        {'DOWN', MRef, _, _, Reason} ->
            error(Reason)
%%    after 5000 ->
%%            error(timeout)
    end.

iguberman · 2017-10-30T23:59:22Z

Hi Ulf, Apparently I had this message below sitting in this browser window unsent for 9 days and just noticed it (they were crazy 9 days, but still....) Very sorry for not replying sooner!!! :( *** Thank so much Ulf! That gives me something to look into on my end as well. Unfortunately I could reproduce it every time I run high load test in our server and pretty soon, right around the time when I need to start examining the stats, exometer_proc crashes and the stats disappear. What I did however as a workaround I removed timeout completely from exometer_proc to receive without any timeout and now exometer_proc doesn't crash at all (possibly I'm thus hiding the real cause of original issue under the rug). ```call(Pid, Req) -> MRef = erlang:monitor(process, Pid), Pid ! {exometer_proc, {self(), MRef}, Req}, receive {MRef, Reply} -> erlang:demonitor(MRef, [flush]), Reply; {'DOWN', MRef, _, _, Reason} -> error(Reason) %% after 5000 -> %% error(timeout) end. ```

…

On Sat, Oct 21, 2017 at 10:58 AM, Ulf Wiger ***@***.***> wrote: Hmm, I take that back (well, not the "should never" ...). In do_report/2 <https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_report.erl#L1213> [handle_info(report_batch, ...)], which is called from the handle_info() clauses for report and report_batch, exometer_report calls exometer:get_value/2, which *may* end up doing an exometer_proc() call to a probe, which would use exometer_proc:call(). Chances are then, that one of your probes becomes unresponsive. Of course, the exometer_report process should not crash because of that – it should not even be blocked. I'll have to look into that. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#99 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALcepoY-dUFKPixQqWMQMN06U6ZQb-LTks5suhTBgaJpZM4QBEpj> .

ruttenberg · 2018-02-27T21:44:59Z

I have observed this problem intermittently. This is an example:

2018 Feb 27 18:02:08 52.23.248.173 erlang: error | gen_server exometer_report terminated with reason: timeout in exometer_proc:call/2 line 131
2018 Feb 27 18:02:13 52.23.248.173 erlang: error | CRASH REPORT Process exometer_report with 0 neighbours exited with reason: timeout in exometer_proc:call/2 line 131 in gen_server:terminate/7 line 812
2018 Feb 27 18:02:14 52.23.248.173 erlang: error | Supervisor exometer_core_sup had child exometer_report started with exometer_report:start_link() at <0.1178.23> exit with reason timeout in exometer_proc:call/2 line 131 in context child_terminated

exometer_report is restarted, and the cycle continues.
This repeats for a while until the VM crashes.

At the point this begins, I notice the following in general VM statistics:
dramatic drop in gc words reclaimed, IO, and reductions.
dramatic increase in scheduler utilization, message queue count.

Please let me know what other information might be helpful in solving this problem.

uwiger · 2018-02-28T08:55:54Z

@ruttenberg, that drop you observe, is it from stats reported by exometer or the actual numbers presented by the VM?

ruttenberg · 2018-02-28T11:56:03Z

That is from stats reported by exometer.

uwiger · 2018-02-28T12:18:54Z

Ok. I have the flu right now. I don't think I'll be able to do much about this today, but will gladly accept a PR that at least catches calls to exometer:get_value/2. One could also argue that get_value() shouldn't crash in the first place. This would also be an acceptable solution.

As to the problem of reporting drift if a particular metric keeps timing out, that's perhaps best noticed in the analytics backend and simply fixing the probe that keeps hanging.

ruttenberg · 2018-02-28T13:39:12Z

@uwiger
I hope you feel better soon.
I think you overestimate my understanding of the exometer implementation. I don't really understand what you are suggesting for the PR.

Thanks.

--Jon Ruttenberg

uwiger · 2018-02-28T14:07:17Z

@ruttenberg, for example:

https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_report.erl#L1438

The exometer_report process expects exometer:get_value/2 not to fail. Reading the docs, this seems like an appropriate assumption, but exometer_report could still wrap these in a try ... catch here.

More appropriate places to put try ... catch might be (fixing the get_value/2 function):
https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer.erl#L362

and https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_probe.erl#L592 (as well as #L595)

ruttenberg · 2018-02-28T17:57:14Z

@uwiger
Thanks. I will take a look at those.

Some additional information:
I have not implemented any probes. The only probe is the built-in histogram.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exometer_proc timeout #99

exometer_proc timeout #99

iguberman commented Oct 20, 2017

uwiger commented Oct 21, 2017

uwiger commented Oct 21, 2017

iguberman commented Oct 21, 2017 •

edited

Loading

iguberman commented Oct 30, 2017 via email

ruttenberg commented Feb 27, 2018

uwiger commented Feb 28, 2018 •

edited

Loading

ruttenberg commented Feb 28, 2018

uwiger commented Feb 28, 2018

ruttenberg commented Feb 28, 2018

uwiger commented Feb 28, 2018

ruttenberg commented Feb 28, 2018

exometer_proc timeout #99

exometer_proc timeout #99

Comments

iguberman commented Oct 20, 2017

uwiger commented Oct 21, 2017

uwiger commented Oct 21, 2017

iguberman commented Oct 21, 2017 • edited Loading

iguberman commented Oct 30, 2017 via email

ruttenberg commented Feb 27, 2018

uwiger commented Feb 28, 2018 • edited Loading

ruttenberg commented Feb 28, 2018

uwiger commented Feb 28, 2018

ruttenberg commented Feb 28, 2018

uwiger commented Feb 28, 2018

ruttenberg commented Feb 28, 2018

iguberman commented Oct 21, 2017 •

edited

Loading

uwiger commented Feb 28, 2018 •

edited

Loading