-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exometer_proc timeout #99
Comments
Hi Irina! Odd that it's the Can you reproduce the problem? |
Hmm, I take that back (well, not the "should never" ...). In Chances are then, that one of your probes becomes unresponsive. Of course, the |
Thanks so much Ulf! That gives me something to look into on my end as well. Unfortunately I could reproduce it every time I run high load test in our server and pretty soon, right around the time when I need to start examining the stats, What I did however as a temp workaround I removed timeout completely from
|
Hi Ulf,
Apparently I had this message below sitting in this browser window unsent
for 9 days and just noticed it (they were crazy 9 days, but still....)
Very sorry for not replying sooner!!! :(
***
Thank so much Ulf!
That gives me something to look into on my end as well. Unfortunately I
could reproduce it every time I run high load test in our server and pretty
soon, right around the time when I need to start examining the stats,
exometer_proc crashes and the stats disappear.
What I did however as a workaround I removed timeout completely from
exometer_proc to receive without any timeout and now exometer_proc doesn't
crash at all (possibly I'm thus hiding the real cause of original issue
under the rug).
```call(Pid, Req) ->
MRef = erlang:monitor(process, Pid),
Pid ! {exometer_proc, {self(), MRef}, Req},
receive
{MRef, Reply} ->
erlang:demonitor(MRef, [flush]),
Reply;
{'DOWN', MRef, _, _, Reason} ->
error(Reason)
%% after 5000 ->
%% error(timeout)
end.
```
…On Sat, Oct 21, 2017 at 10:58 AM, Ulf Wiger ***@***.***> wrote:
Hmm, I take that back (well, not the "should never" ...). In do_report/2
<https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_report.erl#L1213>
[handle_info(report_batch, ...)], which is called from the handle_info()
clauses for report and report_batch, exometer_report calls
exometer:get_value/2, which *may* end up doing an exometer_proc() call to
a probe, which would use exometer_proc:call().
Chances are then, that one of your probes becomes unresponsive. Of course,
the exometer_report process should not crash because of that – it should
not even be blocked. I'll have to look into that.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#99 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALcepoY-dUFKPixQqWMQMN06U6ZQb-LTks5suhTBgaJpZM4QBEpj>
.
|
I have observed this problem intermittently. This is an example: 2018 Feb 27 18:02:08 52.23.248.173 erlang: error | gen_server exometer_report terminated with reason: timeout in exometer_proc:call/2 line 131 exometer_report is restarted, and the cycle continues. At the point this begins, I notice the following in general VM statistics: Please let me know what other information might be helpful in solving this problem. |
@ruttenberg, that drop you observe, is it from stats reported by exometer or the actual numbers presented by the VM? |
That is from stats reported by exometer. |
Ok. I have the flu right now. I don't think I'll be able to do much about this today, but will gladly accept a PR that at least catches calls to As to the problem of reporting drift if a particular metric keeps timing out, that's perhaps best noticed in the analytics backend and simply fixing the probe that keeps hanging. |
@uwiger Thanks. --Jon Ruttenberg |
@ruttenberg, for example: https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_report.erl#L1438 The More appropriate places to put and https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_probe.erl#L592 (as well as #L595) |
@uwiger Some additional information: |
This started happening when we started testing the system under high load. Never seen this problem until then, and it happens a few minutes into our high-load test. The problem is, exometer never recovers after that and we don't get any stats anymore. I tried significantly increasing this timeout in our fork of
exometer_core
, but that didn't help.The text was updated successfully, but these errors were encountered: