Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support scaling down a particular node with graceful termination #5187

Closed
wants to merge 2 commits into from

Conversation

liuxintong
Copy link
Contributor

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a new feature to support scaling down a particular node. As described in #5109, some nodes could become non-functional for some reason specific to the cloud provider. With this change, users can tag a node explicitly by running kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested=true. Cluster-Autoscaler will perform all required safe checks, evict hosted pods and delete the node.

This PR also introduces the feature to scale up the cluster if the current node group size is smaller than the configured min size. This feature is disabled by default, and users need to pass in the bool flag scale-up-to-meet-node-group-min-size-enabled to enable it.

The first feature is slightly depend on the second feature. For example, the min size of a node group is 3, and it has exactly 3 nodes at this moment. If we tag a node to scale down, it wouldn't work because the node group is already at the min size, and the cluster could only be scaled up if it has unschedulable pods to help. Therefore, we need this change to take the scale-down-requested annotation into account and scale up the node group a bit if it's needed.

Which issue(s) this PR fixes:

Fixes #5109

Special notes for your reviewer:

Besides unit tests, this PR has been fully validated in an Azure K8s cluster. I will share more details about my experiments in a separate comment of this PR.

Does this PR introduce a user-facing change?

Support scaling down a particular node if it has `cluster-autoscaler.kubernetes.io/scale-down-requested=true` annotation.
Support scaling up the cluster if the current node group size is smaller than the configured min size (disabled by default).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [FAQ How can I request Clsuter Autoscaler to scale down a particular node?]: https://github.com/liuxintong/autoscaler/blob/7fe865bc03b2e9feab9254227744da97049716fb/cluster-autoscaler/FAQ.md#how-can-i-request-clsuter-autoscaler-to-scale-down-a-particular-node
- [FAQ My cluster is below minimum / above maximum number of nodes, but CA did not fix that! Why?]: https://github.com/liuxintong/autoscaler/blob/7fe865bc03b2e9feab9254227744da97049716fb/cluster-autoscaler/FAQ.md#my-cluster-is-below-minimum--above-maximum-number-of-nodes-but-ca-did-not-fix-that-why
- [FAQ What are the parameters to CA?]: https://github.com/liuxintong/autoscaler/blob/7fe865bc03b2e9feab9254227744da97049716fb/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 13, 2022
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 13, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 13, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @liuxintong!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 13, 2022
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 13, 2022
@liuxintong
Copy link
Contributor Author

All changes in this PR have been verified in the following AKS cluster xint-t0910-aks. This cluster has 1 node group agentpool, and the underlying group name in the cloud provider side is aks-agentpool-22992539-vmss00001o.

(base) ➜  ~ kubectl get nodes -o wide
NAME                                STATUS   ROLES   AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
aks-agentpool-22992539-vmss00001o   Ready    agent   138m   v1.24.3   10.224.0.5    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-22992539-vmss00001t   Ready    agent   71m    v1.24.3   10.224.0.6    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-22992539-vmss00001w   Ready    agent   48m    v1.24.3   10.224.0.9    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4

The autoscaler in AKS side was turned off intentionally, and a test cluster-autoscaler container image was generated with Dockerfile.amd64, uploaded to Azure Container Registry xintcr and deployed to kube-system namespace via the following command.

(base) ➜  cluster-autoscaler git:(0912-ca-exp) ✗ make container-arch-amd64 && docker push xintcr.azurecr.io/cluster-autoscaler-amd64:dev0911 && kubectl apply -f cloudprovider/azure/examples/cluster-autoscaler-vmss-control-plane-xint.yaml

By the way, here was the main container image spec define in cluster-autoscaler-vmss-control-plane-xint.yaml. The node group min size was 3, and the max size was 10. The new feature --scale-up-to-meet-node-group-min-size-enabled was enabled.

      containers:
        - image: xintcr.azurecr.io/cluster-autoscaler-amd64:dev0911
          imagePullPolicy: Always
          name: cluster-autoscaler
          command:
            - ./cluster-autoscaler
            - --v=5
            - --logtostderr=true
            - --cloud-provider=azure
            - --skip-nodes-with-local-storage=false
            - --nodes=3:10:aks-agentpool-22992539-vmss
            - --scale-up-to-meet-node-group-min-size-enabled

Once the cluster-autoscaler test instance was running in the cluster, the node aks-agentpool-22992539-vmss00001o was annotated with scale-down-requested via kubectl.

(base) ➜  ~ kubectl annotate node aks-agentpool-22992539-vmss00001o cluster-autoscaler.kubernetes.io/scale-down-requested=true

The next step was checking the container logs periodically. In the first round, the node group was scaled up from 3 to 4, although it has no unschedulable pods. This is by design, because we need a new surging node to scale down the tagged node.

I0912 19:52:17.182221       1 static_autoscaler.go:445] No unschedulable pods

I0912 19:52:17.182290       1 scale_up.go:669] ScaleUpToMeetNodeGroupMinSize: increased desired min size as node aks-agentpool-22992539-vmss00001o has scale-down-requested annotation
I0912 19:52:17.182390       1 scale_up.go:679] ScaleUpToMeetNodeGroupMinSize: NodeGroup aks-agentpool-22992539-vmss: TargetSize 3, DesiredMinSize 4, MinSize 3, MaxSize 10
I0912 19:52:17.182395       1 scale_up.go:698] ScaleUpToMeetNodeGroupMinSize: final scale-up plan: [{aks-agentpool-22992539-vmss 3->4 (max: 10)}]

In the second round, the node with scale-down-requested annotation was included in the scale down candidates. However, the scale down didn't perform, because scaleDownInCooldown is true, which means the cluster could only be scaled down after the period of scale-down-delay-after-add (the default value is 10 minutes).

I0912 19:52:29.070169       1 scale_up.go:669] ScaleUpToMeetNodeGroupMinSize: increased desired min size as node aks-agentpool-22992539-vmss00001o has scale-down-requested annotation
I0912 19:52:29.070289       1 scale_up.go:679] ScaleUpToMeetNodeGroupMinSize: NodeGroup aks-agentpool-22992539-vmss: TargetSize 4, DesiredMinSize 4, MinSize 3, MaxSize 10
I0912 19:52:29.070294       1 scale_up.go:694] ScaleUpToMeetNodeGroupMinSize: scale up not needed

I0912 19:52:29.070312       1 static_autoscaler.go:515] Calculating unneeded nodes
I0912 19:52:29.070368       1 pre_filtering_processor.go:64] GetScaleDownCandidates: adding aks-agentpool-22992539-vmss00001o as it has scale-down-requested annotation

I0912 19:52:29.070487       1 legacy.go:379] Node aks-agentpool-22992539-vmss00001o is requested to be scaled down
I0912 19:52:29.070682       1 legacy.go:452] Finding additional 3 candidates for scale down.
I0912 19:52:29.070696       1 cluster.go:160] aks-agentpool-22992539-vmss00001o for removal
I0912 19:52:29.070764       1 cluster.go:247] Looking for place for kube-system/konnectivity-agent-59c5567b84-5bw8v
I0912 19:52:29.070829       1 cluster.go:266] Pod kube-system/konnectivity-agent-59c5567b84-5bw8v can be moved to aks-agentpool-22992539-vmss00001t
I0912 19:52:29.070850       1 cluster.go:247] Looking for place for kube-system/coredns-6856d58c9d-9cs68
I0912 19:52:29.070909       1 cluster.go:266] Pod kube-system/coredns-6856d58c9d-9cs68 can be moved to aks-agentpool-22992539-vmss00001w
I0912 19:52:29.070926       1 cluster.go:247] Looking for place for kube-system/metrics-server-69559866b8-cjjlx
I0912 19:52:29.070981       1 cluster.go:266] Pod kube-system/metrics-server-69559866b8-cjjlx can be moved to aks-agentpool-22992539-vmss00001t
I0912 19:52:29.070996       1 cluster.go:182] node aks-agentpool-22992539-vmss00001o may be removed

I0912 19:52:29.071583       1 legacy.go:517] aks-agentpool-22992539-vmss00001o is unneeded since 2022-09-12 19:52:29.014669634 +0000 UTC m=+1126.492898830 duration 0s
I0912 19:52:29.071600       1 legacy.go:517] aks-agentpool-22992539-vmss00001t is unneeded since 2022-09-12 19:52:29.014669634 +0000 UTC m=+1126.492898830 duration 0s
I0912 19:52:29.071609       1 legacy.go:517] aks-agentpool-22992539-vmss00001w is unneeded since 2022-09-12 19:52:29.014669634 +0000 UTC m=+1126.492898830 duration 0s
I0912 19:52:29.071629       1 static_autoscaler.go:558] Scale down status: lastScaleUpTime=2022-09-12 19:52:17.181240736 +0000 UTC m=+1114.659469832 lastScaleDownDeleteTime=2022-09-12 19:44:15.051545164 +0000 UTC m=+632.529774260 lastScaleDownFailTime=2022-09-12 18:34:02.514967734 +0000 UTC m=-3580.006803070 scaleDownForbidden=false scaleDownInCooldown=true

In the next round after the scale down cooldown period, the node aks-agentpool-22992539-vmss00001o was being removed as expected. The cluster had more than 1 candidate, but the node with scale-down-requested has higher priority to be scaled down.

I0912 20:02:21.629980       1 legacy.go:517] aks-agentpool-22992539-vmss00001x is unneeded since 2022-09-12 19:59:40.871272841 +0000 UTC m=+1558.349501937 duration 2m40.75584183s
I0912 20:02:21.629990       1 legacy.go:517] aks-agentpool-22992539-vmss00001o is unneeded since 2022-09-12 19:52:29.014669634 +0000 UTC m=+1126.492898830 duration 9m52.612444937s
I0912 20:02:21.630000       1 legacy.go:517] aks-agentpool-22992539-vmss00001t is unneeded since 2022-09-12 19:59:40.871272841 +0000 UTC m=+1558.349501937 duration 2m40.75584183s
I0912 20:02:21.630015       1 legacy.go:517] aks-agentpool-22992539-vmss00001w is unneeded since 2022-09-12 19:59:40.871272841 +0000 UTC m=+1558.349501937 duration 2m40.75584183s
I0912 20:02:21.630036       1 static_autoscaler.go:558] Scale down status: lastScaleUpTime=2022-09-12 19:52:17.181240736 +0000 UTC m=+1114.659469832 lastScaleDownDeleteTime=2022-09-12 19:44:15.051545164 +0000 UTC m=+632.529774260 lastScaleDownFailTime=2022-09-12 18:34:02.514967734 +0000 UTC m=-3580.006803070 scaleDownForbidden=false scaleDownInCooldown=false
I0912 20:02:21.630067       1 static_autoscaler.go:567] Starting scale down

I0912 20:02:21.630121       1 legacy.go:635] aks-agentpool-22992539-vmss00001x was unneeded for 2m40.75584183s
I0912 20:02:21.630171       1 legacy.go:635] aks-agentpool-22992539-vmss00001o was unneeded for 9m52.612444937s
I0912 20:02:21.630224       1 legacy.go:669] Including aks-agentpool-22992539-vmss00001o - node has scale-down-requested annotation
I0912 20:02:21.630230       1 legacy.go:635] aks-agentpool-22992539-vmss00001t was unneeded for 2m40.75584183s
I0912 20:02:21.630283       1 legacy.go:635] aks-agentpool-22992539-vmss00001w was unneeded for 2m40.75584183s

I0912 20:02:21.630363       1 cluster.go:160] aks-agentpool-22992539-vmss00001o for removal
I0912 20:02:21.630407       1 cluster.go:247] Looking for place for kube-system/konnectivity-agent-59c5567b84-5bw8v
I0912 20:02:21.630456       1 cluster.go:251] Pod kube-system/konnectivity-agent-59c5567b84-5bw8v can be moved to aks-agentpool-22992539-vmss00001t
I0912 20:02:21.630478       1 cluster.go:247] Looking for place for kube-system/coredns-6856d58c9d-9cs68
I0912 20:02:21.630509       1 cluster.go:251] Pod kube-system/coredns-6856d58c9d-9cs68 can be moved to aks-agentpool-22992539-vmss00001w
I0912 20:02:21.630536       1 cluster.go:182] node aks-agentpool-22992539-vmss00001o may be removed
I0912 20:02:21.698354       1 delete.go:103] Successfully added ToBeDeletedTaint on node aks-agentpool-22992539-vmss00001o

I0912 20:02:21.698467       1 actuator.go:194] Scale-down: removing node aks-agentpool-22992539-vmss00001o, utilization: {0.11139896373056994 0.02460904152806767 0 cpu 0.11139896373056994}, pods to reschedule: konnectivity-agent-59c5567b84-5bw8v,coredns-6856d58c9d-9cs68

I0912 20:02:32.143914       1 drain.go:151] All pods removed from aks-agentpool-22992539-vmss00001o
I0912 20:02:32.144060       1 azure_scale_set.go:351] Deleting vmss instances [azure:///subscriptions/3b96dd57-b968-4e2b-8ad7-43bf473caf64/resourceGroups/mc_xint-t0910_xint-t0910-aks_westus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-agentpool-22992539-vmss/virtualMachines/60]
I0912 20:02:32.144185       1 azure_scale_set.go:401] Calling virtualMachineScaleSetsClient.DeleteInstancesAsync(&[60])
I0912 20:02:32.265154       1 azure_scale_set.go:184] Calling virtualMachineScaleSetsClient.WaitForDeleteInstancesResult(&[60]) for aks-agentpool-22992539-vmss

In addition, after deploying the test container image, all pods in kube-system namespace had been running successfully without any restarts.

(base) ➜  ~ kubectl get pods -o wide -n kube-system
NAME                                  READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
azure-ip-masq-agent-4c28k             1/1     Running   0          9h      10.224.0.6    aks-agentpool-22992539-vmss00001t   <none>           <none>
azure-ip-masq-agent-j5ts6             1/1     Running   0          8h      10.224.0.9    aks-agentpool-22992539-vmss00001w   <none>           <none>
azure-ip-masq-agent-kk849             1/1     Running   0          7h54m   10.224.0.7    aks-agentpool-22992539-vmss00001y   <none>           <none>
cloud-node-manager-4jdk5              1/1     Running   0          7h54m   10.224.0.7    aks-agentpool-22992539-vmss00001y   <none>           <none>
cloud-node-manager-94kbz              1/1     Running   0          9h      10.224.0.6    aks-agentpool-22992539-vmss00001t   <none>           <none>
cloud-node-manager-gr6q6              1/1     Running   0          8h      10.224.0.9    aks-agentpool-22992539-vmss00001w   <none>           <none>
cluster-autoscaler-85dfdff7c7-gz52v   1/1     Running   0          8h      10.244.68.4   aks-agentpool-22992539-vmss00001w   <none>           <none>
coredns-6856d58c9d-6d9nq              1/1     Running   0          9h      10.244.65.3   aks-agentpool-22992539-vmss00001t   <none>           <none>
coredns-6856d58c9d-qkws6              1/1     Running   0          7h41m   10.244.70.3   aks-agentpool-22992539-vmss00001y   <none>           <none>
coredns-autoscaler-559d556687-vqpch   1/1     Running   0          8h      10.244.68.5   aks-agentpool-22992539-vmss00001w   <none>           <none>
csi-azuredisk-node-jd8kw              3/3     Running   0          7h54m   10.224.0.7    aks-agentpool-22992539-vmss00001y   <none>           <none>
csi-azuredisk-node-nmkkk              3/3     Running   0          8h      10.224.0.9    aks-agentpool-22992539-vmss00001w   <none>           <none>
csi-azuredisk-node-zglfl              3/3     Running   0          9h      10.224.0.6    aks-agentpool-22992539-vmss00001t   <none>           <none>
csi-azurefile-node-9hn6s              3/3     Running   0          7h54m   10.224.0.7    aks-agentpool-22992539-vmss00001y   <none>           <none>
csi-azurefile-node-n6ngs              3/3     Running   0          8h      10.224.0.9    aks-agentpool-22992539-vmss00001w   <none>           <none>
csi-azurefile-node-n8x82              3/3     Running   0          9h      10.224.0.6    aks-agentpool-22992539-vmss00001t   <none>           <none>
konnectivity-agent-59c5567b84-9zs6h   1/1     Running   0          8h      10.244.65.6   aks-agentpool-22992539-vmss00001t   <none>           <none>
konnectivity-agent-59c5567b84-h6lqz   1/1     Running   0          7h41m   10.244.68.7   aks-agentpool-22992539-vmss00001w   <none>           <none>
kube-proxy-65mt6                      1/1     Running   0          9h      10.224.0.6    aks-agentpool-22992539-vmss00001t   <none>           <none>
kube-proxy-crg77                      1/1     Running   0          8h      10.224.0.9    aks-agentpool-22992539-vmss00001w   <none>           <none>
kube-proxy-gj9dh                      1/1     Running   0          7h54m   10.224.0.7    aks-agentpool-22992539-vmss00001y   <none>           <none>
metrics-server-69559866b8-df4sr       2/2     Running   0          7h34m   10.244.68.8   aks-agentpool-22992539-vmss00001w   <none>           <none>
metrics-server-69559866b8-ghlxg       2/2     Running   0          7h34m   10.244.70.4   aks-agentpool-22992539-vmss00001y   <none>           <none>

@liuxintong
Copy link
Contributor Author

@feiskyer / @x13n - Could you help review this PR? Thanks.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2022
@liuxintong
Copy link
Contributor Author

@x13n - I see you are doing some refactoring, please let me know if you have any concern about this feature before I rebase my PR.

@x13n
Copy link
Member

x13n commented Sep 16, 2022

Hi @liuxintong , thanks for sending the change! I'll try to do a proper review next week. I'm indeed moving quite a lot of scale down logic around, so it will be better to wait until then with merging this PR. One high level comment I have for now is that maybe it makes sense to split this into 2 PRs? Optional enforcement of min size is a feature in itself and doesn't interfere with my changes, so maybe it could be done first.

@liuxintong
Copy link
Contributor Author

Thanks @x13n! Splitting into 2 pull requests also make sense to me. I'll make it in a new PR. Please let me know once your scale down optimization is done, so that I can implement the new feature based on your changes.

@MarcPow
Copy link

MarcPow commented Nov 9, 2022

@x13n is the conflicting refactor now complete?

@x13n
Copy link
Member

x13n commented Nov 10, 2022

Yes it is! The work here shouldn't be blocked on anything now.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 3, 2022
@liuxintong liuxintong changed the title Support explicitly tagging a node for safe eviction and removal Support scaling down a particular node with graceful termination Dec 3, 2022
@liuxintong
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

@liuxintong: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liuxintong
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot
Copy link
Contributor

@liuxintong: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liuxintong
Copy link
Contributor Author

/test all

@k8s-ci-robot
Copy link
Contributor

@liuxintong: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liuxintong
Copy link
Contributor Author

In addition to unit tests, this PR has been verified in an Azure Kubernetes cluster. The following example shows how CA scales down a node that hosts the cluster-autoscaler pod.

@x13n / @MarcPow / @feiskyer, this PR is ready for review, please help take a look, thanks!

(base) ➜  ~ kubectl get node
NAME                                STATUS   ROLES   AGE     VERSION
aks-agentpool-17091672-vmss00002q   Ready    agent   66m     v1.24.3
aks-agentpool-17091672-vmss00002u   Ready    agent   9m16s   v1.24.3
aks-agentpool-17091672-vmss00002v   Ready    agent   88s     v1.24.3

(base) ➜  ~ kubectl get pod -l app=cluster-autoscaler -n kube-system -o wide
NAME                                  READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
cluster-autoscaler-75c6cfc6bd-6z28d   1/1     Running   0          4m58s   10.244.92.2   aks-agentpool-17091672-vmss00002u   <none>           <none>

(base) ➜  ~ kubectl annotate node aks-agentpool-17091672-vmss00002u cluster-autoscaler.kubernetes.io/scale-down-requested=5
node/aks-agentpool-17091672-vmss00002u annotated

(base) ➜  ~ kubectl logs -n kube-system cluster-autoscaler-75c6cfc6bd-6z28d -f > tmpf.log

I1206 08:13:34.678485       1 static_autoscaler.go:524] Calculating unneeded nodes
I1206 08:13:34.678864       1 eligibility.go:140] Node aks-agentpool-17091672-vmss00002u is removable (cpu utilization 0.155699), because it has scale-down-requested annotation

I1206 08:13:34.679567       1 nodes.go:84] aks-agentpool-17091672-vmss00002v is unneeded since 2022-12-06 08:12:24.397596394 +0000 UTC m=+594.022400670 duration 1m10.279817063s
I1206 08:13:34.679578       1 nodes.go:84] aks-agentpool-17091672-vmss00002q is unneeded since 2022-12-06 08:13:24.649375971 +0000 UTC m=+654.274180147 duration 10.028037586s
I1206 08:13:34.679584       1 nodes.go:84] aks-agentpool-17091672-vmss00002u is unneeded since 2022-12-06 08:08:03.447287777 +0000 UTC m=+333.072092053 duration 5m31.23012568s

I1206 08:13:34.679775       1 static_autoscaler.go:581] Starting scale down
I1206 08:13:34.775157       1 delete.go:103] Successfully added ToBeDeletedTaint on node aks-agentpool-17091672-vmss00002u
I1206 08:13:34.775266       1 actuator.go:212] Scale-down: removing node aks-agentpool-17091672-vmss00002u, utilization: {0.15569948186528498 0.056600795514555644 0 cpu 0.15569948186528498}, pods to reschedule: coredns-59b6bf8b4f-26xrt,coredns-autoscaler-5655d66f64-zcmpn,metrics-server-5f8d84558d-qmrh2,konnectivity-agent-598b769b5c-jfljw,cluster-autoscaler-75c6cfc6bd-6z28d

I1206 08:13:34.775843       1 scale_up.go:485] ScaleUpToNodeGroupMinSize: node aks-agentpool-17091672-vmss00002u in node group aks-agentpool-17091672-vmss has scale-down-requested annotation
I1206 08:13:34.775856       1 scale_up.go:490] ScaleUpToNodeGroupMinSize: NodeGroup aks-agentpool-17091672-vmss, TargetSize 3, MinSize 2, MaxSize 10, ScaleDownRequested 1
I1206 08:13:34.775863       1 scale_up.go:535] ScaleUpToNodeGroupMinSize: scale up not needed

I1206 08:13:39.776594       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/coredns-autoscaler-5655d66f64-zcmpn from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation
I1206 08:13:39.776626       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/metrics-server-5f8d84558d-qmrh2 from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation
I1206 08:13:39.776640       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/cluster-autoscaler-75c6cfc6bd-6z28d from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation
I1206 08:13:39.776695       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/konnectivity-agent-598b769b5c-jfljw from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation
I1206 08:13:39.776627       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/coredns-59b6bf8b4f-26xrt from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation
I1206 08:13:39.776595       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/csi-azuredisk-node-48xng from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation
I1206 08:13:39.776607       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/cloud-node-manager-rtwfq from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation
I1206 08:13:39.776840       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/csi-azurefile-node-njxtc from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation
I1206 08:13:39.776611       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/kube-proxy-dh2pg from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation
I1206 08:13:39.776617       1 drain.go:234] Overriding max graceful termination seconds of pod kube-system/azure-ip-masq-agent-6m6gc from 30 to 5, because node aks-agentpool-17091672-vmss00002u has scale-down-requested annotation

I1206 08:13:40.183285       1 main.go:349] Cleaned up, exiting...

(base) ➜  ~ kubectl get node
NAME                                STATUS     ROLES   AGE   VERSION
aks-agentpool-17091672-vmss00002q   Ready      agent   76m   v1.24.3
aks-agentpool-17091672-vmss00002u   NotReady   agent   19m   v1.24.3
aks-agentpool-17091672-vmss00002v   Ready      agent   11m   v1.24.3
aks-agentpool-17091672-vmss00002w   Ready      agent   33s   v1.24.3

(base) ➜  ~ kubectl get pod -l app=cluster-autoscaler -n kube-system -o wide
NAME                                  READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
cluster-autoscaler-75c6cfc6bd-7f2tc   1/1     Running   0          4m13s   10.244.93.5   aks-agentpool-17091672-vmss00002v   <none>           <none>

(base) ➜  ~ kubectl logs -n kube-system cluster-autoscaler-75c6cfc6bd-7f2tc > tmp.log

I1206 08:14:14.396478       1 delete.go:197] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1670314414 Effect:NoSchedule TimeAdded:<nil>} on node aks-agentpool-17091672-vmss00002u
I1206 08:14:14.489510       1 delete.go:228] Successfully released ToBeDeletedTaint on node aks-agentpool-17091672-vmss00002u

I1206 08:14:14.491200       1 static_autoscaler.go:524] Calculating unneeded nodes
I1206 08:14:14.491426       1 eligibility.go:140] Node aks-agentpool-17091672-vmss00002u is removable (cpu utilization 0.080311), because it has scale-down-requested annotation

I1206 08:14:14.491890       1 static_autoscaler.go:581] Starting scale down
I1206 08:14:14.491976       1 nodes.go:192] Node aks-agentpool-17091672-vmss00002u is removable, because it has scale-down-requested annotation
I1206 08:14:14.591497       1 delete.go:103] Successfully added ToBeDeletedTaint on node aks-agentpool-17091672-vmss00002u
I1206 08:14:14.591543       1 actuator.go:161] Scale-down: removing empty node "aks-agentpool-17091672-vmss00002u"

I1206 08:14:19.815745       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"e1c2d971-7210-40cc-9404-38387d3aa23f", APIVersion:"v1", ResourceVersion:"33886671", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: empty node aks-agentpool-17091672-vmss00002u removed

@x13n
Copy link
Member

x13n commented Dec 19, 2022

/assign

Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to review the code and you can see a bunch of my comments on some of the files, but I actually started to doubt this is something we should be adding to Cluster Autoscaler. This isn't really about autoscaling, it is about automatic node repairs. If I understand correctly, the use case here is to manually tag certain broken nodes for removal. This looks like something that could be already achieved by cordoning/draining node with kubectl, followed by VM removal via cloud provider API. The min size enforcement will then kick in, if necessary. WDYT?

@@ -32,6 +32,7 @@ this document:
* [How can I see all the events from Cluster Autoscaler?](#how-can-i-see-all-events-from-cluster-autoscaler)
* [How can I scale my cluster to just 1 node?](#how-can-i-scale-my-cluster-to-just-1-node)
* [How can I scale a node group to 0?](#how-can-i-scale-a-node-group-to-0)
* [How can I request Clsuter Autoscaler to scale down a particular node?](#how-can-i-request-clsuter-autoscaler-to-scale-down-a-particular-node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo (here and in other places): clsuter->cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed all typos.


```
kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested=30
kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a note that while one can remove the annotation, it doesn't guarantee the node won't be removed. If Cluster Autoscaler already started draining the node, removing the annotation will have no effect. I think this is an important caveat to document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added the disclaimer.


Starting with CA 1.26.0, nodes will be evicted by CA if it has the annotation requesting scale-down.
* The annotation key is `cluster-autoscaler.kubernetes.io/scale-down-requested`.
* The annotation value is a number representing the max graceful termination seconds for pods hosted on the node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to rename the annotation so that the meaning of value doesn't require reading the FAQ? I was thinking about something along the lines of cluster-autoscaler.kubernetes.io/enforced-scale-down-graceful-termination-seconds, but this is a bit long, hope you have a better idea :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense, I renamed it to cluster-autoscaler.kubernetes.io/force-scale-down-with-grace-period-minutes, but I'm not sure if this is a better name.

Btw, I've moved from annotations to taints.

continue
}
if len(groupsWithNodes[ng]) == 0 {
groupsWithNodes[ng] = make([]*apiv1.Node, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary, append(nil, node) will already return a single-element slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

@@ -486,7 +509,7 @@ func ScaleUpToNodeGroupMinSize(context *context.AutoscalingContext, processors *
continue
}

newNodeCount := ng.MinSize() - targetSize
newNodeCount := ng.MinSize() + scaleDownRequestedCount - targetSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried this will cause nodes to be removed right after adding. Consider the following scenario:

  1. Annotation gets added to a node n1
  2. Scale down starts to consider n1 for deletion
  3. This code triggers a scale up and creates node n2
  4. Annotation gets added to a node n3
  5. Scale down considers n2 instead of n3 for deletion, because it is empty
  6. This code triggers a scale up and creates n4
  7. Scale down removes n1 (unneeded long enough)
  8. Scale down removes n2 (unneeded long enough)
  9. This code has to create another replacement for n3

If there's no special handling of annotated nodes here, the scale up to min is purely reactive, which would cause node count to sometimes go below min, but then recover.

Another problematic scenario:

  1. Node n1 gets annotated
  2. Node n1 starts getting drained
  3. This code triggers creation of a new node n2
  4. Pods evicted from n1 are recreated and manage to schedule on other existing nodes in the cluster
  5. Node n2 becomes ready, but is empty and scale down has to delete it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking thru all possible scenarios.

Here are some logics we have added to avoid node churning caused by the force-scale-down nodes:

  • The scale-up will be triggered only if the pods on the force-scale-down node cannot be rescheduled on existing nodes.
  • The scale-down will be triggered only if the node is still unneeded after rescheduling all pods from the force-scale-down nodes.
  • If the scale-down candidates have multiple nodes, the force-scale-down node will have higher priority.

@@ -224,6 +225,19 @@ func evictPod(ctx *acontext.AutoscalingContext, podToEvict *apiv1.Pod, isDaemonS
}
}

if utils.HasScaleDownRequestedAnnotation(node) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to calculate maxTermination becomes quite complicated with this change, please extract it to a separate function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've moved all draining related logic to k8s.io/autoscaler/cluster-autoscaler/simulator/drainability/rules/forcescaledown.

Btw, the drainability rule is a nice refactoring in the recent year.

@@ -135,6 +131,16 @@ func (c *Checker) unremovableReasonAndNodeUtilization(context *context.Autoscali
return simulator.NotAutoscaled, nil
}

utilInfo, err := utilization.Calculate(nodeInfo, context.IgnoreDaemonSetsUtilization, context.IgnoreMirrorPodsUtilization, context.CloudProvider.GPULabel(), timestamp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why move this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this part in the new iteration. We only have very few changes in this file.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2023
@x13n
Copy link
Member

x13n commented Feb 13, 2023

Given my reasoning above and lack of activity here, I'm going to close this PR. Please reopen if you disagree.

/close

@k8s-ci-robot
Copy link
Contributor

@x13n: Closed this PR.

In response to this:

Given my reasoning above and lack of activity here, I'm going to close this PR. Please reopen if you disagree.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liuxintong
Copy link
Contributor Author

Reopening as I'm currently working on it.
/reopen

@k8s-ci-robot k8s-ci-robot reopened this Mar 25, 2024
@k8s-ci-robot
Copy link
Contributor

@liuxintong: Reopened this PR.

In response to this:

Reopening as I'm currently working on it.
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liuxintong
Once this PR has been reviewed and has the lgtm label, please ask for approval from x13n. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 25, 2024
@x13n
Copy link
Member

x13n commented Mar 27, 2024

Hi @liuxintong ! Can you clarify what is the benefit of having this logic in autoscaler? IIUC the user will observe more or less the same behavior by annotating the pod as if they just did kubectl drain.

Copy link
Contributor Author

@liuxintong liuxintong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n - I was going to reply to your previous comments, but I didn't get any time until today. Apologized for the delay.

Hi @liuxintong ! Can you clarify what is the benefit of having this logic in autoscaler?

Technically, we can developer a new cluster controller to meet all the business needs. But it would have lots of duplicate logics, just like what we have here. Then we also need to resolve scale-down conflicts between the new controller and cluster-autoscaler. CA already has a mature implementation in this area, like scale-down simulation, node draining rules, cloud provider integration, etc. I'd like to leverage it to achieve the goal of scaling down a specific node.

Another motivation is that new node provisioning takes longer on Windows than on Linux, and we want to reduce the pod pending time to minimize the impact on services. When we need to scale down a node, we can scale up a new node at the same time. Then the pods can be moved from the old node to the new node as quickly as possible.

IIUC the user will observe more or less the same behavior by annotating the pod as if they just did kubectl drain.

You are right if we only have a few clusters to operate. However, we mange thousands of clusters. This simple problem gets complicated when the scale is up.

For the 2 options you mentioned, I think the main differences are as follows:

  • kubectl annotate/kubectl taint: it returns immediately, and we can guarantee success with the logic in CA.
  • kubectl drain: we need to track the execution externally, and the node draining might be failed (no guarantee).

In addition, all related code logics are controlled by the new flag --force-scale-down-enabled, and the default value is false. If others don't need it, they won't feel any difference.

@MarcPow also shared more context in Issue #5109. Please let us know if you have any additional concerns. Thank you!

// than the configured min size. The source of truth for the current node group
// size is the TargetSize queried directly from cloud providers. Returns
// than the required min size, which is calculated based on the node group min
// size configuration and the number of force-scale-down tainted nodes. Returns
// appropriate status or error if an unexpected error occurred.
func (o *ScaleUpOrchestrator) ScaleUpToNodeGroupMinSize(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kisieland / @BigDarkClown - Thanks for reviewing PR #5663. I'm fixing Issue #5624 here, please take another look.

FYI: @mwielgus

@@ -32,6 +32,7 @@ this document:
* [How can I see all the events from Cluster Autoscaler?](#how-can-i-see-all-events-from-cluster-autoscaler)
* [How can I scale my cluster to just 1 node?](#how-can-i-scale-my-cluster-to-just-1-node)
* [How can I scale a node group to 0?](#how-can-i-scale-a-node-group-to-0)
* [How can I request Clsuter Autoscaler to scale down a particular node?](#how-can-i-request-clsuter-autoscaler-to-scale-down-a-particular-node)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed all typos.


```
kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested=30
kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested-
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added the disclaimer.


Starting with CA 1.26.0, nodes will be evicted by CA if it has the annotation requesting scale-down.
* The annotation key is `cluster-autoscaler.kubernetes.io/scale-down-requested`.
* The annotation value is a number representing the max graceful termination seconds for pods hosted on the node.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense, I renamed it to cluster-autoscaler.kubernetes.io/force-scale-down-with-grace-period-minutes, but I'm not sure if this is a better name.

Btw, I've moved from annotations to taints.

continue
}
if len(groupsWithNodes[ng]) == 0 {
groupsWithNodes[ng] = make([]*apiv1.Node, 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

@@ -486,7 +509,7 @@ func ScaleUpToNodeGroupMinSize(context *context.AutoscalingContext, processors *
continue
}

newNodeCount := ng.MinSize() - targetSize
newNodeCount := ng.MinSize() + scaleDownRequestedCount - targetSize
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking thru all possible scenarios.

Here are some logics we have added to avoid node churning caused by the force-scale-down nodes:

  • The scale-up will be triggered only if the pods on the force-scale-down node cannot be rescheduled on existing nodes.
  • The scale-down will be triggered only if the node is still unneeded after rescheduling all pods from the force-scale-down nodes.
  • If the scale-down candidates have multiple nodes, the force-scale-down node will have higher priority.

@@ -224,6 +225,19 @@ func evictPod(ctx *acontext.AutoscalingContext, podToEvict *apiv1.Pod, isDaemonS
}
}

if utils.HasScaleDownRequestedAnnotation(node) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've moved all draining related logic to k8s.io/autoscaler/cluster-autoscaler/simulator/drainability/rules/forcescaledown.

Btw, the drainability rule is a nice refactoring in the recent year.

@@ -135,6 +131,16 @@ func (c *Checker) unremovableReasonAndNodeUtilization(context *context.Autoscali
return simulator.NotAutoscaled, nil
}

utilInfo, err := utilization.Calculate(nodeInfo, context.IgnoreDaemonSetsUtilization, context.IgnoreMirrorPodsUtilization, context.CloudProvider.GPULabel(), timestamp)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this part in the new iteration. We only have very few changes in this file.

@Bryce-Soghigian
Copy link
Member

I still don't see the benefit over just kubectl drain

@x13n
Copy link
Member

x13n commented Mar 29, 2024 via email

@sftim
Copy link
Contributor

sftim commented Mar 29, 2024

If the only difference between kubectl drain and kubectl annotate is the need to wait for actuation, then perhaps kubernetes/enhancements#4212 is going to address this use case better?

That KEP provides a better authorization story; we can allow users (and controllers) to request node drains, and we can allow a controller to trigger draining even without allowing it to write to a node; labelling nodes can break expectations around workload isolation.

@liuxintong if you're willing to contribute to defining that KEP, I think it provides a good way forward. The work you've done on this PR can help ensure that the cluster autoscaler is ready for the arrival of declarative node drains.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@x13n
Copy link
Member

x13n commented Jun 6, 2024

I'm closing this one due to inactivity. Looks like long term we can depend on declarative node maintenance for this use case.

/close

@k8s-ci-robot
Copy link
Contributor

@x13n: Closed this PR.

In response to this:

I'm closing this one due to inactivity. Looks like long term we can depend on declarative node maintenance for this use case.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support explicitly tagging a node for safe eviction and removal
7 participants