Update to controller-runtime 0.19.1 / Kube 1.31 #293

xrstf · 2024-10-08T16:04:33Z

This PR brings Prow up-to-speed with the latest Kubernetes and controller-runtime dependencies, plus a few more changes to make these new dependencies work.

controller-tools 0.16.4

Without this update, codegen would fail:

k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".PodSpec
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".Affinity
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".Toleration
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".PodSpec
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".ResourceRequirements
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".Affinity
k8s.io/api/core/v1:-: unknown type "k8s.io/api/core/v1".Toleration
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".PodSpec
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".ResourceRequirements
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".ResourceRequirements
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".ResourceRequirements
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".ResourceRequirements
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".Affinity
sigs.k8s.io/prow/pkg/apis/prowjobs/v1:-: unable to locate schema for type "k8s.io/api/core/v1".Toleration
Error: not all generators ran successfully
run `controller-gen crd:crdVersions=v1 paths=./pkg/apis/prowjobs/v1 output:stdout -w` to see all available markers, or `controller-gen crd:crdVersions=v1 paths=./pkg/apis/prowjobs/v1 output:stdout -h` for usage

golangci-lint 1.58.0

After updating code-generator, staticcheck suddenly threw false positives like:

Go version: go version go1.22.3 linux/amd64
Golangci-lint version: golangci-lint has version v1.57.2 built with go1.22.3 from (unknown, modified: ?, mod sum: "h1:NNhxfZyL5He1WWDrIvl1a4n5bvWZBcgAqBwlJAAgLTw=") on (unknown)
pkg/pluginhelp/hook/hook_test.go:146:44: SA5011: possible nil pointer dereference (staticcheck)
        if got, expected := sets.New[string](help.AllRepos...), sets.New[string](expectedAllRepos...); !got.Equal(expected) {
                                                  ^
pkg/pluginhelp/hook/hook_test.go:143:5: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)
        if help == nil {

However looking at the code, the help == nil check is leading to a t.Fatal, which should be recognized by staticcheck. I have no idea why this suddenly happened, but updating to the next highest golangci-lint version fixes the issue.

Flakiness due to rate limitting

I noticed some tests flaking a lot and started digging. It turns out the issue wasn't actually from loops timing out or contexts getting cancelled, but from the client-side rate limitting that is enabled in the kube clients. I think during integration tests it doesn't make much sense to have rate limitting, as this would mean a lot of code potentially has to handle errors arising from it.

I have therefore disabled the rate limiter by setting cfg.RateLimiter = flowcontrol.NewFakeAlwaysRateLimiter() in the integration test utility code.

Deck re-run tests

These tests have been reworked quite a bit, as they were quite flaky. The issue ultimately boiled down to the old code sorting ProwJobs by ResourceVersion, but during testing I found that it happens quite a lot that ProwJobs are created/updated nearly simultaneously. This has been resolved by sorting the ProwJobs by CreationTimestamp instead, which is unaffected by update calls.

However that is nearly the smallest change in the refactoring.

Instead of deleting Pods, the code now rotates the underlying Deploymens and waits for all Pods to be rotated as well. This is IMO the superior, cleaner solution and less prone to race conditions.
This rotation is now performed every time the job config is updated, to consistently guarantee Pods have gotten the updates. It is unclear to me why the tests for the first job-config update did rotate the Pods, but then later on had a 3*30s waiting loop to just wait for the config updates to be propagated. Now it's IMO clearer: When the config is updated, control plane is rotated and we wait until that's done before proceeding.
The old tests leaked ProwJobs because Horologium would at the last moment often create one last Job not caught by the cleanup. This has been fixed now, too, by again simply rotating the Horologium deployment.
Re-running a job is now a standalone function, mostly for readability. I also replaced the homegrown retry loop with wait.PollUntilContextTimeout. It's IMO unnecessary to have a back-off mechanism in integration tests like this. It just needlessly slows down the test.

The "rotate Deployment instead of deleting Pods manually"-method has been applied to all other integration tests.

k8s-ci-robot · 2024-10-08T16:04:36Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2024-10-08T16:04:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrstf
Once this PR has been reviewed and has the lgtm label, please assign krzyzacy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-10-08T16:04:51Z

✅ Deploy Preview for k8s-prow ready!

Name	Link
🔨 Latest commit	`543ebf9`
🔍 Latest deploy log	https://app.netlify.com/sites/k8s-prow/deploys/675dabe0acec730008096557
😎 Deploy Preview	https://deploy-preview-293--k8s-prow.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

xrstf · 2024-10-08T16:05:11Z

/test all

xrstf · 2024-10-08T18:52:21Z

/test all

xrstf · 2024-10-12T18:06:42Z

/test all

xrstf · 2024-11-10T14:31:15Z

/test all

xrstf · 2024-11-10T15:28:11Z

/test all

xrstf · 2024-11-10T15:59:31Z

/test all

petr-muller · 2024-11-11T14:32:41Z

/cc

stevekuznetsov · 2024-11-11T15:44:18Z

pkg/scheduler/reconciler_test.go

@@ -61,11 +61,11 @@ func (ft *fakeTracker) Get(gvr schema.GroupVersionResource, ns, name string, opt
 	return ft.ObjectTracker.Get(gvr, ns, name, opts...)
 }

-func (ft *fakeTracker) Update(gvr schema.GroupVersionResource, obj runtime.Object, ns string, opts ...metav1.UpdateOptions) error {


Going to file this away in my mental bank of "reasons the fake client is not what you want".

stevekuznetsov · 2024-11-11T15:46:24Z

test/integration/test/deck_test.go

+		// can rerun from.
+		// Horologium itself is pretty good at handling the configmap update, but
+		// not kubelet, according to
+		// https://github.com/kubernetes/kubernetes/issues/30189 kubelet syncs


n.b. the linked issue says some semantically useless annotation update should kick the kubelet

stevekuznetsov · 2024-11-11T15:47:32Z

test/integration/test/deck_test.go

-	if !passed {
-		t.Fatal("Expected updated job.")
+	// Wait for the first job to be created by horologium.
+	initialJob := getLatestJob(t, jobName, nil)


nit: inital = getLatest() is confusing - is it the initial or the latest?

stevekuznetsov · 2024-11-11T15:50:56Z

test/integration/test/deck_test.go

 		}); err != nil {
 			t.Logf("ERROR CLEANUP: %v", err)
 		}
 	})
-	ctx := context.Background()
+
 	getLatestJob := func(t *testing.T, jobName string, lastRun *v1.Time) *prowjobv1.ProwJob {


It seems like we're missing the meaning of what this function was originally written to do - is the issue that the function expected to sort these resources by the resourceVersion at which they were created, but when there are interspersed UPDATE calls, the objects' current resourceVersion no longer sorts them?

Can we sort by the job ID since we know that's monotonically increasing? Creation timestamp is an awkward choice as it can have ties.

Sorry for the long response delay.

Which "job ID" are you referring to exactly? The only numerical ID I can see is the prow.k8s.io/build-id and that is not unique per ProwJob. Is prow.k8s.io/id not just a random UUID, but monotonically increasing?

I don't see any other good value to use besides the creation timestamp when I want to search by, well, creation order:

apiVersion: prow.k8s.io/v1 kind: ProwJob metadata: annotations: prow.k8s.io/context: "" prow.k8s.io/job: rerun-test-job-3a2c5361172244414edc254fc8d21de5 creationTimestamp: "2024-12-14T15:54:56Z" generation: 3 labels: created-by-prow: "true" foo: foo prow.k8s.io/build-id: "1867961480935116800" prow.k8s.io/context: "" prow.k8s.io/id: f48bd245-1335-4ef4-8556-d7b9d232014a prow.k8s.io/job: rerun-test-job-3a2c5361172244414edc254fc8d21de5 prow.k8s.io/type: periodic name: f48bd245-1335-4ef4-8556-d7b9d232014a namespace: default resourceVersion: "9383" uid: 56b93b1c-4a53-4ba6-90e6-49ce918115cd spec: agent: kubernetes cluster: default job: rerun-test-job-3a2c5361172244414edc254fc8d21de5 namespace: test-pods pod_spec: containers: - args: - Hello World! command: - echo image: localhost:5001/alpine name: "" resources: {} prowjob_defaults: tenant_id: GlobalDefaultID report: true type: periodic status: build_id: "1867961480935116800" completionTime: "2024-12-14T15:54:58Z" description: Job succeeded. pendingTime: "2024-12-14T15:54:56Z" pod_name: f48bd245-1335-4ef4-8556-d7b9d232014a startTime: "2024-12-14T15:54:56Z" state: success

stevekuznetsov · 2024-11-11T15:51:49Z

test/integration/test/deck_test.go

 	}

+	// Prevent Deck from being too fast and recreating the new job in the same second
+	// as the previous one.
+	time.Sleep(1 * time.Second)


What's the downside of having the second job created in the same second? Can we fix that instead of adding a sleep?

Without the artifical delay, situations like this can happen:

pj[0] = name=3ee585ca-d93f-4fbc-9a09-b5c3baeeb2b6, created=2024-12-14 16:54:43 +0100 CET, id=1867961304296198144 pj[1] = name=b5578f85-ecb6-4741-b3d4-f9f33f099c59, created=2024-12-14 16:54:14 +0100 CET, id=1867961304296198144 pj[2] = name=094a01cf-782b-427a-b5e1-e02e551f604d, created=2024-12-14 16:54:43 +0100 CET, id=1867961425939402752

This leads to an instable sorting order, making the test flake. :-/

stevekuznetsov · 2024-11-11T15:53:47Z

test/integration/test/setup.go

 		}
+
+		ready := true &&


nit: in my experience, the moment this does not correctly happen within the timeout, return ready will hide the details from the engineer debugging this, which makes for an unpleasant set of next steps. Could we please format the conditions you're looking for as a string, log it out on state transitions (e.g. do not spam log when nothing has changed), and indicate whether the observed state is as expected or not?

stevekuznetsov

Mostly looks great! Couple small comments.

…heck

k8s-ci-robot · 2024-12-24T19:50:44Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 8, 2024

k8s-ci-robot requested review from cjwagner and stevekuznetsov October 8, 2024 16:04

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2024

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 8, 2024

This was referenced Oct 9, 2024

Prow issue: ملفات Andoka Cloud والمستندات والقوالب والتعليمات لتطبيقات البرمجة #295

Closed

Prow issue: Andoka Xx #291

Closed

xrstf mentioned this pull request Oct 12, 2024

Update code-generator to 1.30.1 #298

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 8, 2024

xrstf force-pushed the controller-runtime-0.19 branch from 9102251 to 97a1840 Compare November 10, 2024 14:30

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 10, 2024

xrstf force-pushed the controller-runtime-0.19 branch from 97a1840 to 2fc8f26 Compare November 10, 2024 14:30

xrstf changed the title ~~Update to controller-runtime 0.19.0 / Kube 1.31~~ Update to controller-runtime 0.19.1 / Kube 1.31 Nov 10, 2024

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 10, 2024

xrstf force-pushed the controller-runtime-0.19 branch from f1a2136 to 68c7dc4 Compare November 10, 2024 15:59

xrstf marked this pull request as ready for review November 10, 2024 16:19

k8s-ci-robot requested review from MadhavJivrajani and Priyankasaggu11929 November 10, 2024 16:20

k8s-ci-robot requested a review from petr-muller November 11, 2024 14:32

stevekuznetsov reviewed Nov 11, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 14, 2024

This was referenced Nov 22, 2024

Prow issue: Andoka Cloud pro #301

Closed

Prow issue: #300

Closed

xrstf and others added 11 commits December 14, 2024 15:48

bump controller-runtime to 0.19.1

7bd74be

adjust for new typed work queues and rate limiters

f158f7d

disable unique controller name check in plank unit tests

8865925

fake client does not do UPDATE calls anymore when Patch() is called

94b41b0

prevent rate limiter randomly interfering with tests

97aa5ed

clarify documentation

6e5be7d

fix TestRerun flakiness

533a3ee

bump code-generator to 0.31 and controller-tools to 0.16.4

cf735c3

codegen

1c86b46

bump golangci-lint enough to work around false positives from staticc…

9c981b6

…heck

improve log output when waiting for Deployment rollouts

543ebf9

xrstf force-pushed the controller-runtime-0.19 branch from 68c7dc4 to 543ebf9 Compare December 14, 2024 16:01

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 14, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 24, 2024

hashim21223445 mentioned this pull request Dec 28, 2024

Prow issue: #297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to controller-runtime 0.19.1 / Kube 1.31 #293

Update to controller-runtime 0.19.1 / Kube 1.31 #293

xrstf commented Oct 8, 2024 •

edited

Loading

k8s-ci-robot commented Oct 8, 2024

k8s-ci-robot commented Oct 8, 2024

netlify bot commented Oct 8, 2024 •

edited

Loading

xrstf commented Oct 8, 2024

xrstf commented Oct 8, 2024

xrstf commented Oct 12, 2024

xrstf commented Nov 10, 2024

xrstf commented Nov 10, 2024

xrstf commented Nov 10, 2024

petr-muller commented Nov 11, 2024

stevekuznetsov Nov 11, 2024

stevekuznetsov Nov 11, 2024

stevekuznetsov Nov 11, 2024

stevekuznetsov Nov 11, 2024

xrstf Dec 14, 2024

stevekuznetsov Nov 11, 2024

xrstf Dec 14, 2024

stevekuznetsov Nov 11, 2024

stevekuznetsov left a comment

k8s-ci-robot commented Dec 24, 2024

Update to controller-runtime 0.19.1 / Kube 1.31 #293

Are you sure you want to change the base?

Update to controller-runtime 0.19.1 / Kube 1.31 #293

Conversation

xrstf commented Oct 8, 2024 • edited Loading

controller-tools 0.16.4

golangci-lint 1.58.0

Flakiness due to rate limitting

Deck re-run tests

k8s-ci-robot commented Oct 8, 2024

k8s-ci-robot commented Oct 8, 2024

netlify bot commented Oct 8, 2024 • edited Loading

✅ Deploy Preview for k8s-prow ready!

xrstf commented Oct 8, 2024

xrstf commented Oct 8, 2024

xrstf commented Oct 12, 2024

xrstf commented Nov 10, 2024

xrstf commented Nov 10, 2024

xrstf commented Nov 10, 2024

petr-muller commented Nov 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevekuznetsov left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 24, 2024

xrstf commented Oct 8, 2024 •

edited

Loading

netlify bot commented Oct 8, 2024 •

edited

Loading