-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDSRV-573: Fix crash because of prom-client timeout #5694
Conversation
Hello bourgoismickael,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor (very minor if this only happens in the CI):
"err":{},"results":{"message":"Error: Operation timed out."}
The results
field is a bit unintuitive I think (I wouldn't grep for results in the logs).
Could we maybe have something like "error":{"message":"Error: Operation timed out."}
?
7a490b8
to
cbaa5d6
Compare
Integration data createdI have created the integration data for the additional destination branches.
The following branches will NOT be impacted:
You can set option
The following options are set: approve |
I haven't seen the issue in production labs but I have in pre-production labs, one of them was FreePro. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need better errors in logs
I want to approve, but as the /approve command is there, it will be merged automatically without control.
@@ -48,7 +56,7 @@ function monitoringHandler(clientIP, req, res, log) { | |||
function monitoringEndHandler(err, results) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth renaming this method to handleMonitoringResponse
Probably out of scope of this PR.
Fix crashes of primary because of prom-client 5s timeout. Mostly to happen at startup when workers are not ready. Should also fix error write EPIPE in workers by preventing primary to crash.
cbaa5d6
to
9981e50
Compare
/create_integration_branches |
/approve |
Integration data createdI have created the integration data for the additional destination branches.
The following branches will NOT be impacted:
You can set option
The following options are set: approve, create_integration_branches |
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
Please check the status of the associated issue CLDSRV-573. Goodbye bourgoismickael. The following options are set: approve, create_integration_branches |
Instead of crashing it will now stay alive, return a 500 with body
{"message":"Error: Operation timed out."}
and logFor other arsenal error we will have the message field:
"err":{"MethodNotAllowed":true,"message":"The specified method is not allowed against this resource."}
This changes will not go into ZENKO as they don't use the cluster module with prom-client
Important
This problem happens often in low resource platform (like CI).
This will fix many flaky CI on Federation that fails because the step
Check if s3 Prometheus exporter is active
retry 3 times ith small delay, crashing s3 multiple time