Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDSRV-573: Fix crash because of prom-client timeout #5694

Merged
merged 2 commits into from
Nov 6, 2024

Conversation

BourgoisMickael
Copy link
Contributor

@BourgoisMickael BourgoisMickael commented Nov 6, 2024

Instead of crashing it will now stay alive, return a 500 with body {"message":"Error: Operation timed out."} and log

{"name":"S3","clientIP":"::1","clientPort":53518,"httpMethod":"GET","httpURL":"/metrics","err":{"message":"Operation timed out."},"time":1730901149499,"req_id":"31b0c58cb14cad5c9583","elapsed_ms":5002.625237,"level":"warn","message":"monitoring error","hostname":"MDM-RING-46789-store-1","pid":115}

For other arsenal error we will have the message field: "err":{"MethodNotAllowed":true,"message":"The specified method is not allowed against this resource."}

This changes will not go into ZENKO as they don't use the cluster module with prom-client

Important

This problem happens often in low resource platform (like CI).
This will fix many flaky CI on Federation that fails because the step Check if s3 Prometheus exporter is active retry 3 times ith small delay, crashing s3 multiple time

@bert-e
Copy link
Contributor

bert-e commented Nov 6, 2024

Hello bourgoismickael,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
Copy link
Contributor

@dvasilas dvasilas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor (very minor if this only happens in the CI):
"err":{},"results":{"message":"Error: Operation timed out."}

The results field is a bit unintuitive I think (I wouldn't grep for results in the logs).

Could we maybe have something like "error":{"message":"Error: Operation timed out."} ?

@BourgoisMickael BourgoisMickael force-pushed the bugfix/CLDSRV-573-prom-client branch from 7a490b8 to cbaa5d6 Compare November 6, 2024 12:27
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@bert-e
Copy link
Contributor

bert-e commented Nov 6, 2024

Integration data created

I have created the integration data for the additional destination branches.

The following branches will NOT be impacted:

  • development/7.10
  • development/7.4

You can set option create_pull_requests if you need me to create
integration pull requests in addition to integration branches, with:

@bert-e create_pull_requests

The following options are set: approve

@anurag4DSB
Copy link
Contributor

anurag4DSB commented Nov 6, 2024

minor (very minor if this only happens in the CI): "err":{},"results":{"message":"Error: Operation timed out."}

The results field is a bit unintuitive I think (I wouldn't grep for results in the logs).

Could we maybe have something like "error":{"message":"Error: Operation timed out."} ?

I haven't seen the issue in production labs but I have in pre-production labs, one of them was FreePro.
Edit: It happens during CS startup time.

Copy link
Contributor

@anurag4DSB anurag4DSB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need better errors in logs
I want to approve, but as the /approve command is there, it will be merged automatically without control.

lib/utilities/monitoringHandler.js Outdated Show resolved Hide resolved
@@ -48,7 +56,7 @@ function monitoringHandler(clientIP, req, res, log) {
function monitoringEndHandler(err, results) {
Copy link
Contributor

@anurag4DSB anurag4DSB Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth renaming this method to handleMonitoringResponse
Probably out of scope of this PR.

Fix crashes of primary because of prom-client 5s timeout.
Mostly to happen at startup when workers are not ready.
Should also fix error write EPIPE in workers by preventing
primary to crash.
@BourgoisMickael BourgoisMickael force-pushed the bugfix/CLDSRV-573-prom-client branch from cbaa5d6 to 9981e50 Compare November 6, 2024 13:56
@BourgoisMickael
Copy link
Contributor Author

/create_integration_branches

@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@BourgoisMickael
Copy link
Contributor Author

/approve

@scality scality deleted a comment from bert-e Nov 6, 2024
@bert-e
Copy link
Contributor

bert-e commented Nov 6, 2024

Integration data created

I have created the integration data for the additional destination branches.

The following branches will NOT be impacted:

  • development/7.10
  • development/7.4

You can set option create_pull_requests if you need me to create
integration pull requests in addition to integration branches, with:

@bert-e create_pull_requests

The following options are set: approve, create_integration_branches

@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@bert-e
Copy link
Contributor

bert-e commented Nov 6, 2024

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/7.70

  • ✔️ development/8.6

  • ✔️ development/8.7

  • ✔️ development/8.8

The following branches have NOT changed:

  • development/7.10
  • development/7.4

Please check the status of the associated issue CLDSRV-573.

Goodbye bourgoismickael.

The following options are set: approve, create_integration_branches

@bert-e bert-e merged commit 9981e50 into development/7.70 Nov 6, 2024
11 checks passed
@bert-e bert-e deleted the bugfix/CLDSRV-573-prom-client branch November 6, 2024 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants