Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability of artifact-caching-proxy #4442

Open
darinpope opened this issue Dec 6, 2024 · 5 comments
Open

Instability of artifact-caching-proxy #4442

darinpope opened this issue Dec 6, 2024 · 5 comments

Comments

@darinpope
Copy link

darinpope commented Dec 6, 2024

Service(s)

Artifact-caching-proxy

Summary

Bruno had to run the weekly BOM release process five times today (2024-12-06) because of errors like the following:

  • Could not transfer artifact com.google.crypto.tink:tink:jar:1.10.0 from/to azure-aks-internal (http://artifact-caching-proxy.artifact-caching-proxy.svc.cluster.local:8080/): Premature end of Content-Length delimited message body (expected: 2,322,048; received: 1,572,251)

Here's the issue where he tracked the build numbers so you can see the specific failures:

jenkinsci/bom#4066

I also had similar issues doing a BOM weekly-test against a core RC that I'm working on:

Since I started working on BOM the past couple of months, this problem seems to be getting worse/more unstable as the weeks progress.

Reproduction steps

Unfortunately, it is not reproducible on demand.

@darinpope darinpope added the triage Incoming issues that need review label Dec 6, 2024
@darinpope darinpope changed the title High number of Instability of artifact-caching-proxy Dec 6, 2024
@dduportal dduportal added this to the infra-team-sync-2024-12-10 milestone Dec 7, 2024
@dduportal dduportal removed the triage Incoming issues that need review label Dec 9, 2024
@dduportal dduportal self-assigned this Dec 9, 2024
@dduportal
Copy link
Contributor

Starting analysing logs on ACP side

@dduportal
Copy link
Contributor

For each of the failing requests found in the past 15 days (including each one you folks logged) ACP did report an error due to the upstream, in the following categories:

  • upstream prematurely closed connection while reading upstream
  • peer closed connection in SSL handshake (104: Connection reset by peer) while SSL handshaking to upstream
  • upstream timed out (110: Operation timed out) while SSL handshaking to upstream
  • Error HTTP/500 responded by Artifactory

We also had 1 occurence repo.jenkins-ci.org could not be resolved (2: Server failure) which indicates a local DNS resolution error.

@dduportal
Copy link
Contributor

=> The errors are definitively not due to an ACP problem. By design, it "reports" the error.
Eventually, some timeouts could be caused by the TCP tuning on the ACP instance: gotta check.

=> We could check if we can "retry" the upstream in case of error, I need to recall which cases could be caught

@dduportal
Copy link
Contributor

@MarkEWaite did open a PR , based on a discussion we had during the previous infra meeting: jenkinsci/bom#4095

The goal is to "pre-heat" the cache to decrease the probability of facing these issues

@dduportal
Copy link
Contributor

@MarkEWaite did open a PR , based on a discussion we had during the previous infra meeting: jenkinsci/bom#4095

The goal is to "pre-heat" the cache to decrease the probability of facing these issues

I haven't heard about any ACP problem with the BOM since the "pre-heat" PR was merged. Of course it might have been (I have not looked with due diligence).

Were there any issues in the past 3 weeks @basil @darinpope @Poddingue @MarkEWaite @alecharpentier?

For info, this issue is on stale, until we've finished migrated ci.jenkins.io to AWS (see #4313) which implies a new ACP instance (in a new infra).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants