Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to controller-runtime 0.19.1 / Kube 1.31 #293

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

xrstf
Copy link
Contributor

@xrstf xrstf commented Oct 8, 2024

This PR brings Prow up-to-speed with the latest Kubernetes and controller-runtime dependencies, plus a few more changes to make these new dependencies work.

controller-tools 0.16.4

Without this update, codegen would fail:

k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".PodSpec
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".Affinity
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".Toleration
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".PodSpec
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".Affinity
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".Toleration
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".PodSpec
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".ResourceRequirements
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".ResourceRequirements
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".ResourceRequirements
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".ResourceRequirements
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".Affinity
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".Toleration
Error: not all generators ran successfully
run `controller-gen crd:crdVersions=v1 paths=./pkg/apis/prowjobs/v1 output:stdout -w` to see all available markers, or `controller-gen crd:crdVersions=v1 paths=./pkg/apis/prowjobs/v1 output:stdout -h` for usage

golangci-lint 1.58.0

After updating code-generator, staticcheck suddenly threw false positives like:

Go version: go version go1.22.3 linux/amd64
Golangci-lint version: golangci-lint has version v1.57.2 built with go1.22.3 from (unknown, modified: ?, mod sum: "h1:NNhxfZyL5He1WWDrIvl1a4n5bvWZBcgAqBwlJAAgLTw=") on (unknown)
pkg/pluginhelp/hook/hook_test.go:146:44: SA5011: possible nil pointer dereference (staticcheck)
        if got, expected := sets.New[string](help.AllRepos...), sets.New[string](expectedAllRepos...); !got.Equal(expected) {
                                                  ^
pkg/pluginhelp/hook/hook_test.go:143:5: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)
        if help == nil {

However looking at the code, the help == nil check is leading to a t.Fatal, which should be recognized by staticcheck. I have no idea why this suddenly happened, but updating to the next highest golangci-lint version fixes the issue.

Flakiness due to rate limitting

I noticed some tests flaking a lot and started digging. It turns out the issue wasn't actually from loops timing out or contexts getting cancelled, but from the client-side rate limitting that is enabled in the kube clients. I think during integration tests it doesn't make much sense to have rate limitting, as this would mean a lot of code potentially has to handle errors arising from it.

I have therefore disabled the rate limiter by setting cfg.RateLimiter = flowcontrol.NewFakeAlwaysRateLimiter() in the integration test utility code.

Deck re-run tests

These tests have been reworked quite a bit, as they were quite flaky. The issue ultimately boiled down to the old code sorting ProwJobs by ResourceVersion, but during testing I found that it happens quite a lot that ProwJobs are created/updated nearly simultaneously. This has been resolved by sorting the ProwJobs by CreationTimestamp instead, which is unaffected by update calls.

However that is nearly the smallest change in the refactoring.

  • Instead of deleting Pods, the code now rotates the underlying Deploymens and waits for all Pods to be rotated as well. This is IMO the superior, cleaner solution and less prone to race conditions.
  • This rotation is now performed every time the job config is updated, to consistently guarantee Pods have gotten the updates. It is unclear to me why the tests for the first job-config update did rotate the Pods, but then later on had a 3*30s waiting loop to just wait for the config updates to be propagated. Now it's IMO clearer: When the config is updated, control plane is rotated and we wait until that's done before proceeding.
  • The old tests leaked ProwJobs because Horologium would at the last moment often create one last Job not caught by the cleanup. This has been fixed now, too, by again simply rotating the Horologium deployment.
  • Re-running a job is now a standalone function, mostly for readability. I also replaced the homegrown retry loop with wait.PollUntilContextTimeout. It's IMO unnecessary to have a back-off mechanism in integration tests like this. It just needlessly slows down the test.

The "rotate Deployment instead of deleting Pods manually"-method has been applied to all other integration tests.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 8, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrstf
Once this PR has been reviewed and has the lgtm label, please assign krzyzacy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2024
Copy link

netlify bot commented Oct 8, 2024

Deploy Preview for k8s-prow ready!

Name Link
🔨 Latest commit 543ebf9
🔍 Latest deploy log https://app.netlify.com/sites/k8s-prow/deploys/675dabe0acec730008096557
😎 Deploy Preview https://deploy-preview-293--k8s-prow.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@xrstf
Copy link
Contributor Author

xrstf commented Oct 8, 2024

/test all

1 similar comment
@xrstf
Copy link
Contributor Author

xrstf commented Oct 8, 2024

/test all

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 8, 2024
@xrstf
Copy link
Contributor Author

xrstf commented Oct 12, 2024

/test all

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 8, 2024
@xrstf xrstf force-pushed the controller-runtime-0.19 branch from 9102251 to 97a1840 Compare November 10, 2024 14:30
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 10, 2024
@xrstf xrstf force-pushed the controller-runtime-0.19 branch from 97a1840 to 2fc8f26 Compare November 10, 2024 14:30
@xrstf xrstf changed the title Update to controller-runtime 0.19.0 / Kube 1.31 Update to controller-runtime 0.19.1 / Kube 1.31 Nov 10, 2024
@xrstf
Copy link
Contributor Author

xrstf commented Nov 10, 2024

/test all

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 10, 2024
@xrstf
Copy link
Contributor Author

xrstf commented Nov 10, 2024

/test all

@xrstf xrstf force-pushed the controller-runtime-0.19 branch from f1a2136 to 68c7dc4 Compare November 10, 2024 15:59
@xrstf
Copy link
Contributor Author

xrstf commented Nov 10, 2024

/test all

@xrstf xrstf marked this pull request as ready for review November 10, 2024 16:19
@petr-muller
Copy link
Contributor

/cc

@@ -61,11 +61,11 @@ func (ft *fakeTracker) Get(gvr schema.GroupVersionResource, ns, name string, opt
return ft.ObjectTracker.Get(gvr, ns, name, opts...)
}

func (ft *fakeTracker) Update(gvr schema.GroupVersionResource, obj runtime.Object, ns string, opts ...metav1.UpdateOptions) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to file this away in my mental bank of "reasons the fake client is not what you want".

// can rerun from.
// Horologium itself is pretty good at handling the configmap update, but
// not kubelet, according to
// https://github.com/kubernetes/kubernetes/issues/30189 kubelet syncs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n.b. the linked issue says some semantically useless annotation update should kick the kubelet

if !passed {
t.Fatal("Expected updated job.")
// Wait for the first job to be created by horologium.
initialJob := getLatestJob(t, jobName, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inital = getLatest() is confusing - is it the initial or the latest?

}); err != nil {
t.Logf("ERROR CLEANUP: %v", err)
}
})
ctx := context.Background()

getLatestJob := func(t *testing.T, jobName string, lastRun *v1.Time) *prowjobv1.ProwJob {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're missing the meaning of what this function was originally written to do - is the issue that the function expected to sort these resources by the resourceVersion at which they were created, but when there are interspersed UPDATE calls, the objects' current resourceVersion no longer sorts them?

Can we sort by the job ID since we know that's monotonically increasing? Creation timestamp is an awkward choice as it can have ties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long response delay.

Which "job ID" are you referring to exactly? The only numerical ID I can see is the prow.k8s.io/build-id and that is not unique per ProwJob. Is prow.k8s.io/id not just a random UUID, but monotonically increasing?

I don't see any other good value to use besides the creation timestamp when I want to search by, well, creation order:

apiVersion: prow.k8s.io/v1
kind: ProwJob
metadata:
  annotations:
    prow.k8s.io/context: ""
    prow.k8s.io/job: rerun-test-job-3a2c5361172244414edc254fc8d21de5
  creationTimestamp: "2024-12-14T15:54:56Z"
  generation: 3
  labels:
    created-by-prow: "true"
    foo: foo
    prow.k8s.io/build-id: "1867961480935116800"
    prow.k8s.io/context: ""
    prow.k8s.io/id: f48bd245-1335-4ef4-8556-d7b9d232014a
    prow.k8s.io/job: rerun-test-job-3a2c5361172244414edc254fc8d21de5
    prow.k8s.io/type: periodic
  name: f48bd245-1335-4ef4-8556-d7b9d232014a
  namespace: default
  resourceVersion: "9383"
  uid: 56b93b1c-4a53-4ba6-90e6-49ce918115cd
spec:
  agent: kubernetes
  cluster: default
  job: rerun-test-job-3a2c5361172244414edc254fc8d21de5
  namespace: test-pods
  pod_spec:
    containers:
    - args:
      - Hello World!
      command:
      - echo
      image: localhost:5001/alpine
      name: ""
      resources: {}
  prowjob_defaults:
    tenant_id: GlobalDefaultID
  report: true
  type: periodic
status:
  build_id: "1867961480935116800"
  completionTime: "2024-12-14T15:54:58Z"
  description: Job succeeded.
  pendingTime: "2024-12-14T15:54:56Z"
  pod_name: f48bd245-1335-4ef4-8556-d7b9d232014a
  startTime: "2024-12-14T15:54:56Z"
  state: success

}

// Prevent Deck from being too fast and recreating the new job in the same second
// as the previous one.
time.Sleep(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the downside of having the second job created in the same second? Can we fix that instead of adding a sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the artifical delay, situations like this can happen:

pj[0] = name=3ee585ca-d93f-4fbc-9a09-b5c3baeeb2b6, created=2024-12-14 16:54:43 +0100 CET, id=1867961304296198144
pj[1] = name=b5578f85-ecb6-4741-b3d4-f9f33f099c59, created=2024-12-14 16:54:14 +0100 CET, id=1867961304296198144
pj[2] = name=094a01cf-782b-427a-b5e1-e02e551f604d, created=2024-12-14 16:54:43 +0100 CET, id=1867961425939402752

This leads to an instable sorting order, making the test flake. :-/

}

ready := true &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in my experience, the moment this does not correctly happen within the timeout, return ready will hide the details from the engineer debugging this, which makes for an unpleasant set of next steps. Could we please format the conditions you're looking for as a string, log it out on state transitions (e.g. do not spam log when nothing has changed), and indicate whether the observed state is as expected or not?

Copy link
Contributor

@stevekuznetsov stevekuznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks great! Couple small comments.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 14, 2024
This was referenced Nov 22, 2024
@xrstf xrstf force-pushed the controller-runtime-0.19 branch from 68c7dc4 to 543ebf9 Compare December 14, 2024 16:01
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 14, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 24, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hashim21223445 hashim21223445 mentioned this pull request Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants