native support for ArgoRollout CRD to improve efficiency #7517

ZihanJiang96 · 2024-11-21T19:00:06Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

We also use Argo Rollout heavily, recently we enabeld VPA in most of our clusters, and we observed the VPA controllers get throttled a lot due to the frequent /scale call.

We made the same improvement internally in a forked repo, want to contribute this back.

Which issue(s) this PR fixes:

Fixes ##7024

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2024-11-21T19:00:16Z

Hi @ZihanJiang96. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-11-21T19:00:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ZihanJiang96
Once this PR has been reviewed and has the lgtm label, please assign kgolab for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

vertical-pod-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adrianmoisey · 2024-11-21T20:48:45Z

/ok-to-test

adrianmoisey · 2024-11-22T10:04:52Z

/test pull-kubernetes-e2e-autoscaling-vpa-full

adrianmoisey · 2024-11-22T11:52:43Z

I think a change like this needs to be handled with a little more grace. For example, when running in a cluster without ArgoRollouts setup, the following errors are created:

I1122 11:50:55.800594       1 reflector.go:325] Listing and watching *v1alpha1.Rollout from k8s.io/autoscaler/vertical-pod-autoscaler/pkg/target/fetcher.go:99
W1122 11:50:55.804826       1 reflector.go:535] k8s.io/autoscaler/vertical-pod-autoscaler/pkg/target/fetcher.go:99: failed to list *v1alpha1.Rollout: rollouts.argoproj.io is forbidden: User "system:serviceaccount:kube-system:vpa-admission-controller" cannot list resource "rollouts" in API group "argoproj.io" at the cluster scope
E1122 11:50:55.804965       1 reflector.go:147] k8s.io/autoscaler/vertical-pod-autoscaler/pkg/target/fetcher.go:99: Failed to watch *v1alpha1.Rollout: failed to list *v1alpha1.Rollout: rollouts.argoproj.io is forbidden: User "system:serviceaccount:kube-system:vpa-admission-controller" cannot list resource "rollouts" in API group "argoproj.io" at the cluster scope

(well, I guess technically this is a permission issue).

Is there a way to add a feature like this in a more general manner? I'm a bit worried that this change sets a precedent that more CRs will need to be added, putting more strain the very few maintainers of this project

k8s-ci-robot · 2024-11-22T11:59:21Z

@ZihanJiang96: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-autoscaling-vpa-full	`0181922`	link	false	`/test pull-kubernetes-e2e-autoscaling-vpa-full`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

raywainman · 2024-11-25T21:10:50Z

We've encountered this a few times now. Thanks for bringing this up.

We discussed a similar issue with @voelzmo in #6431.

I think it would be great to resolve this more generically such that a native integration with each framework, object, etc.. is not necessary.

I wonder if we can change the behavior of the cache (as suggested in #6431 (comment)) to be per Kind rather than per object. I believe if we can do this then we won't have the need for these custom integrations.

raywainman · 2024-11-27T21:31:07Z

I've been staring at this for a little while and have a few thoughts:

We call the current cache for two reasons:
- To know whether a particular controller is scalable (true/false).
- To return the owners of a scalable subresource. We need this because we are looking for the top-most scalable subresource. If there is a "well-known" controller in the chain, we stop there despite it possibly having an owner that is also a scale subresource (I would have to dig up the reason why we do this, I don't fully know off the top of my head).
To know whether a particular controller is scalable, we don't actually need to call /scale on it. Once we know a certain GroupResource is scalable, we don't need to do it for every single instance of it. I think there is opportunity to cache this part. This would likely resolve the issues we were seeing in check the scalability of one kind prior to get its scale subresource from apiserver #6431 (where the owner controller isn't a scale subresource).
Unfortunately, even if we cached the first use-case, we still have to contend with the second one. However, if a controller has no owner (as is the case with ArgoRollout CRD per my understanding), the we would skip this step per this line.

My proposal:

I don't think we can get rid of the cache as it is because we still need it to do the lookup of a scale-subresource's owners.
We can however improve the efficiency of the whole system by adding another layer of caching to know whether a particular GroupResource (or GroupKind?) is scalable (or not) and then take shortcuts from there. This should significantly reduce the number of calls to API Server.

I'm thinking of prototyping this (or if anyone sees a better solution, please feel free to chime in) and seeing how many things I break.

This should remove the need to have any custom integration with VPA as a single call to /scale for particular Kind of Resource should do the trick.

adrianmoisey · 2024-11-28T10:42:42Z

I'm thinking of prototyping this (or if anyone sees a better solution, please feel free to chime in) and seeing how many things I break.

Your plan sounds good and I'm keen to see the prototype

voelzmo · 2024-12-03T09:03:46Z

Hey @raywainman thanks for sharing your thoughts here! I think it would be great if we could make this whole thing more efficient!

To know whether a particular controller is scalable, we don't actually need to call /scale on it. Once we know a certain GroupResource is scalable, we don't need to do it for every single instance of it. I think there is opportunity to cache this part

Yes, absolutely. As pointed out earlier, I also think that we don't need to make the calls checking /scale for every object of a kind, it is fine to do it per kind or GroupResource.

We can however improve the efficiency of the whole system by adding another layer of caching to know whether a particular GroupResource (or GroupKind?) is scalable (or not) and then take shortcuts from there. This should significantly reduce the number of calls to API Server.

Big thumbs up, I'd really like to see this!

Regarding

We call the current cache for two reasons
(...)

To return the owners of a scalable subresource. We need this because we are looking for the top-most scalable subresource. If there is a "well-known" controller in the chain, we stop there despite it possibly having an owner that is also a scale subresource (I would have to dig up the reason why we do this, I don't fully know off the top of my head).

What the code tries to do is to find the topmost owner of a Pod which is either well-known or implements the /scale subresource. It does this by going up the ownerRef chain until an owner is found which isn't either well-known or scalable (btw, also checking the owner of an unknown controller uses the /scale subresource). For the method isWellKnownOrScalable it seems fine to exit either if this is a well-known controller or if a /scale subresource is found – and then we go further up the chain.

I've seen errors in the past [1, 2] where the targetRef pointed to a well-known controller, such as Deployment or StatefulSet, but VPA errored out, because there was an operator deployed which created and owned those Deployments/StatefulSets and also implemented the /scale subresource. So I think this works as expected.

k8s-ci-robot · 2024-12-10T07:44:39Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

native support for ArgoRollout CRD to improve efficiency

0181922

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 21, 2024

k8s-ci-robot requested review from kgolab and voelzmo November 21, 2024 19:00

k8s-ci-robot added area/vertical-pod-autoscaler size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 21, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 21, 2024

figo mentioned this pull request Nov 22, 2024

Native support for ArgoRollout CRD to improve efficiency #7024

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2024

yunwang0911 mentioned this pull request Dec 13, 2024

check the scalability of one kind prior to get its scale subresource from apiserver #6431

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

native support for ArgoRollout CRD to improve efficiency #7517

native support for ArgoRollout CRD to improve efficiency #7517

ZihanJiang96 commented Nov 21, 2024 •

edited

Loading

k8s-ci-robot commented Nov 21, 2024

k8s-ci-robot commented Nov 21, 2024

adrianmoisey commented Nov 21, 2024

adrianmoisey commented Nov 22, 2024

adrianmoisey commented Nov 22, 2024

k8s-ci-robot commented Nov 22, 2024

raywainman commented Nov 25, 2024

raywainman commented Nov 27, 2024

adrianmoisey commented Nov 28, 2024

voelzmo commented Dec 3, 2024

k8s-ci-robot commented Dec 10, 2024

native support for ArgoRollout CRD to improve efficiency #7517

Are you sure you want to change the base?

native support for ArgoRollout CRD to improve efficiency #7517

Conversation

ZihanJiang96 commented Nov 21, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Nov 21, 2024

k8s-ci-robot commented Nov 21, 2024

adrianmoisey commented Nov 21, 2024

adrianmoisey commented Nov 22, 2024

adrianmoisey commented Nov 22, 2024

k8s-ci-robot commented Nov 22, 2024

raywainman commented Nov 25, 2024

raywainman commented Nov 27, 2024

adrianmoisey commented Nov 28, 2024

voelzmo commented Dec 3, 2024

k8s-ci-robot commented Dec 10, 2024

ZihanJiang96 commented Nov 21, 2024 •

edited

Loading