Support scaling up node groups to the configured min size if needed #5195

liuxintong · 2022-09-17T02:58:07Z

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind feature

What this PR does / why we need it:

The node group size can be smaller than the minimum size configured in cluster-autoscaler, but cluster-autoscaler doesn't support scaling up the cluster to the desired state. Here are 2 common scenarios that could cause this issue.

The node group was configured with a smaller min size at the beginning. Later on, the min size was adjusted to a bigger number.
The node was deleted directly from Kubernetes or the cloud provider.

To support scenarios above, we need this feature to scale up node groups that have less nodes than the configured node group min size.

Which issue(s) this PR fixes:

Fixes #5162
Fixes #4942

Special notes for your reviewer:

This feature is disabled by default. It has been verified in unit tests. I'll also test it in a real cluster.

Does this PR introduce a user-facing change?

Introduced a new flag `--enforce-node-group-min-size` to enforce the node group minimum size. For node groups with fewer nodes than the configuration, cluster-autoscaler will scale them up to the minimum number of nodes. To enable this feature, please set it to `true` in the command.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [FAQ My cluster is below minimum / above maximum number of nodes, but CA did not fix that! Why?]: https://github.com/liuxintong/autoscaler/blob/691273c07c508219ace1f86fd14cec9f0fc90a42/cluster-autoscaler/FAQ.md#my-cluster-is-below-minimum--above-maximum-number-of-nodes-but-ca-did-not-fix-that-why

liuxintong · 2022-09-17T03:03:04Z

@x13n, as discussed in #5187, I've moved the "pure" scale-up feature into this PR. Please take a look.

liuxintong · 2022-09-17T05:38:57Z

This PR is verified in an AKS cluster with the following steps.

At the beginning, the node group had 3 nodes.

(base) ➜  ~ kubectl get nodes -o wide
NAME                                STATUS   ROLES   AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
aks-agentpool-17091672-vmss000000   Ready    agent   74m   v1.24.3   10.224.0.4    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000001   Ready    agent   74m   v1.24.3   10.224.0.5    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000002   Ready    agent   74m   v1.24.3   10.224.0.6    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4

Then cluster-autoscaler was deployed to the cluster with the follow config. We can see the min size 5 in the command is greater than the node group size at that moment.

          command:
            - ./cluster-autoscaler
            - --v=5
            - --logtostderr=true
            - --cloud-provider=azure
            - --skip-nodes-with-local-storage=false
            - --nodes=5:10:aks-agentpool-17091672-vmss
            - --scale-up-to-node-group-min-size-enabled

As expected, the node group was scaling up to 5 nodes in the first main loop.

I0917 05:10:37.945333       1 static_autoscaler.go:445] No unschedulable pods
I0917 05:10:37.945339       1 azure_scale_set.go:149] VMSS: aks-agentpool-17091672-vmss, returning in-memory size: 3
I0917 05:10:37.945345       1 scale_up.go:658] ScaleUpToNodeGroupMinSize: NodeGroup aks-agentpool-17091672-vmss, TargetSize 3, MinSize 5, MaxSize 10
I0917 05:10:37.945350       1 scale_up.go:677] ScaleUpToNodeGroupMinSize: final scale-up plan: [{aks-agentpool-17091672-vmss 3->5 (max: 10)}]
I0917 05:10:37.945366       1 scale_up.go:769] Scale-up: setting group aks-agentpool-17091672-vmss size to 5
I0917 05:10:37.945397       1 azure_scale_set.go:149] VMSS: aks-agentpool-17091672-vmss, returning in-memory size: 3
I0917 05:10:37.945410       1 azure_scale_set.go:247] Waiting for virtualMachineScaleSetsClient.CreateOrUpdateAsync(aks-agentpool-17091672-vmss)
I0917 05:10:37.945530       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"82516817-75db-4080-81de-7ac28ce71d43", APIVersion:"v1", ResourceVersion:"18713", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group aks-agentpool-17091672-vmss size to 5 instead of 3 (max: 10)

After few minutes, the node group became 5 nodes, which means the cluster-autoscaler was able to scale up the node group to meet the min size requirement.

(base) ➜  ~ kubectl get nodes -o wide
NAME                                STATUS   ROLES   AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
aks-agentpool-17091672-vmss000000   Ready    agent   83m     v1.24.3   10.224.0.4    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000001   Ready    agent   82m     v1.24.3   10.224.0.5    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000002   Ready    agent   82m     v1.24.3   10.224.0.6    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000003   Ready    agent   4m55s   v1.24.3   10.224.0.7    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000004   Ready    agent   5m11s   v1.24.3   10.224.0.8    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4

The next experiment is to verify if the node group can be scaled up again if a node is deleted manually.

(base) ➜  ~ kubectl drain --ignore-daemonsets --delete-emptydir-data aks-agentpool-17091672-vmss000001
(base) ➜  ~ kubectl delete node aks-agentpool-17091672-vmss000001
(base) ➜  ~ az vmss delete-instances --instance-ids 1 --name aks-agentpool-17091672-vmss --resource-group MC_xintliu-t0916_xintliu-t0916-aks_westus2

(base) ➜  ~ kubectl get nodes -o wide
NAME                                STATUS   ROLES   AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
aks-agentpool-17091672-vmss000000   Ready    agent   90m   v1.24.3   10.224.0.4    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000002   Ready    agent   90m   v1.24.3   10.224.0.6    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000003   Ready    agent   12m   v1.24.3   10.224.0.7    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000004   Ready    agent   12m   v1.24.3   10.224.0.8    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4

Checking the log right after the instance deletion, we saw the node group is being scaled up from 4 to 5. And after few minutes, the new node appeared in the node list.

I0917 05:23:43.425330       1 scale_up.go:658] ScaleUpToNodeGroupMinSize: NodeGroup aks-agentpool-17091672-vmss, TargetSize 4, MinSize 5, MaxSize 10
I0917 05:23:43.425339       1 scale_up.go:677] ScaleUpToNodeGroupMinSize: final scale-up plan: [{aks-agentpool-17091672-vmss 4->5 (max: 10)}]
I0917 05:23:43.425351       1 scale_up.go:769] Scale-up: setting group aks-agentpool-17091672-vmss size to 5

(base) ➜  ~ kubectl get nodes -o wide
NAME                                STATUS   ROLES   AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
aks-agentpool-17091672-vmss000000   Ready    agent   94m     v1.24.3   10.224.0.4    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000002   Ready    agent   94m     v1.24.3   10.224.0.6    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000003   Ready    agent   16m     v1.24.3   10.224.0.7    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000004   Ready    agent   16m     v1.24.3   10.224.0.8    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4
aks-agentpool-17091672-vmss000005   Ready    agent   3m49s   v1.24.3   10.224.0.5    <none>        Ubuntu 18.04.6 LTS   5.4.0-1089-azure   containerd://1.6.4+azure-4

In addition, the cluster-autoscaler pod worked well, and no restarts happened.

(base) ➜  ~ kubectl get pods --all-namespaces -l app=cluster-autoscaler -o wide
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE   IP           NODE                                NOMINATED NODE   READINESS GATES
kube-system   cluster-autoscaler-854466647f-2zkwm   1/1     Running   0          24m   10.244.2.3   aks-agentpool-17091672-vmss000002   <none>           <none>

mickare · 2022-09-18T08:37:02Z

Thanks @liuxintong for the PR! ❤️

I have one question, because I would not expect an additional flag to enable an expected behavior.
Isn't it the expected default behavior that the autoscaler produces the desired state?

liuxintong · 2022-09-19T04:47:24Z

Hi @mickare, I set the feature disabled as default just in case others have dependencies on the old "unexpected" behaviors. What you said also makes sense to me, I'm open for suggestions.

cluster-autoscaler/FAQ.md

cluster-autoscaler/core/static_autoscaler.go

cluster-autoscaler/core/static_autoscaler_test.go

cluster-autoscaler/core/scale_up_test.go

cluster-autoscaler/core/scale_up.go

liuxintong · 2022-10-06T03:56:01Z

@x13n, thanks for the review! I've addressed all comments in the latest iteration, please take another look.

liuxintong · 2022-10-06T06:58:58Z

Fixed the go lint and the build.

liuxintong · 2022-10-11T06:20:39Z

/auto-cc

liuxintong · 2022-10-11T06:26:18Z

/cc feiskyer towca Jeffwan

cluster-autoscaler/main.go

cluster-autoscaler/core/scaleup/resource_manager.go

cluster-autoscaler/core/scaleup/resource_manager_test.go

liuxintong · 2022-10-13T08:14:51Z

@x13n, thanks again for your code review. The latest iteration has resolved all comments. Could you please take another look? Please let me know if anybody else needs to be involved in this PR.

x13n

Overall LGTM, just two minor comments.

cluster-autoscaler/core/scale_up.go

cluster-autoscaler/core/static_autoscaler.go

liuxintong · 2022-10-21T06:29:59Z

Resolved git conflicts.

liuxintong · 2022-10-22T05:18:19Z

/assign @x13n

x13n · 2022-10-25T12:50:35Z

/lgtm

liuxintong · 2022-10-29T08:02:18Z

/assign feiskyer towca

liuxintong · 2022-11-03T04:49:17Z

Rebased to resolve git conflicts with master.

k8s-ci-robot · 2022-11-03T13:04:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: feiskyer, liuxintong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

feiskyer · 2022-11-03T13:05:35Z

Thanks for contributing the feature and squashed the commits.
/lgtm

liuxintong · 2022-11-04T04:14:53Z

Thanks @feiskyer and @x13n for reviewing and approving the pull request!

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 17, 2022

k8s-ci-robot requested review from feiskyer and x13n September 17, 2022 02:58

jbartosik added the area/cluster-autoscaler label Sep 26, 2022

x13n requested changes Sep 29, 2022

View reviewed changes

liuxintong force-pushed the 0916-ca-su branch from 691273c to fd80545 Compare October 6, 2022 03:52

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 6, 2022

liuxintong force-pushed the 0916-ca-su branch from fd80545 to 1ea3671 Compare October 6, 2022 06:57

liuxintong requested review from x13n and removed request for feiskyer October 6, 2022 06:59

k8s-ci-robot requested a review from aleksandra-malinowska October 11, 2022 06:20

k8s-ci-robot requested review from feiskyer, Jeffwan and towca October 11, 2022 06:26

x13n requested changes Oct 11, 2022

View reviewed changes

liuxintong force-pushed the 0916-ca-su branch from 1ea3671 to 3ae0a0a Compare October 13, 2022 08:03

liuxintong force-pushed the 0916-ca-su branch from 3ae0a0a to 81b8507 Compare October 14, 2022 06:09

x13n requested changes Oct 19, 2022

View reviewed changes

cluster-autoscaler/core/scale_up.go Outdated Show resolved Hide resolved

cluster-autoscaler/core/static_autoscaler.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned x13n Oct 22, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2022

k8s-ci-robot assigned feiskyer and towca Oct 29, 2022

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 2, 2022

liuxintong force-pushed the 0916-ca-su branch from 3f66775 to 2d512f1 Compare November 3, 2022 04:41

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 3, 2022

Support scaling up node groups to the configured min size if needed

524886f

liuxintong force-pushed the 0916-ca-su branch from 2d512f1 to 524886f Compare November 3, 2022 04:47

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 3, 2022

feiskyer approved these changes Nov 3, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 3, 2022

k8s-ci-robot merged commit de56060 into kubernetes:master Nov 3, 2022

MaciekPytel mentioned this pull request Dec 14, 2022

Add x13n to cluster autoscaler approvers #5367

Merged

6 tasks

maksim-paskal mentioned this pull request Feb 9, 2023

Worker nodes initially is only 1 maksim-paskal/hcloud-k8s-ctl#71

Closed

kisieland mentioned this pull request Mar 27, 2023

Cluster autoscaler scale to min may breach cluster wide limits #5624

Closed

apricote mentioned this pull request Mar 30, 2023

docs: fix invalid flag name #5638

Merged

elmiko mentioned this pull request Dec 18, 2023

cluster-autoscaler CAPI scale up when current number of nodes is less than minSize #5267

Closed

dkoshkin mentioned this pull request Mar 6, 2024

feat: add cluster-autoscaler CRS addon nutanix-cloud-native/cluster-api-runtime-extensions-nutanix#423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support scaling up node groups to the configured min size if needed #5195

Support scaling up node groups to the configured min size if needed #5195

liuxintong commented Sep 17, 2022 •

edited

Loading

liuxintong commented Sep 17, 2022

liuxintong commented Sep 17, 2022

mickare commented Sep 18, 2022

liuxintong commented Sep 19, 2022

liuxintong commented Oct 6, 2022

liuxintong commented Oct 6, 2022

liuxintong commented Oct 11, 2022

liuxintong commented Oct 11, 2022

liuxintong commented Oct 13, 2022

x13n left a comment

liuxintong commented Oct 21, 2022

liuxintong commented Oct 22, 2022

x13n commented Oct 25, 2022

liuxintong commented Oct 29, 2022

liuxintong commented Nov 3, 2022

k8s-ci-robot commented Nov 3, 2022

feiskyer commented Nov 3, 2022

liuxintong commented Nov 4, 2022

Support scaling up node groups to the configured min size if needed #5195

Support scaling up node groups to the configured min size if needed #5195

Conversation

liuxintong commented Sep 17, 2022 • edited Loading

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

liuxintong commented Sep 17, 2022

liuxintong commented Sep 17, 2022

mickare commented Sep 18, 2022

liuxintong commented Sep 19, 2022

liuxintong commented Oct 6, 2022

liuxintong commented Oct 6, 2022

liuxintong commented Oct 11, 2022

liuxintong commented Oct 11, 2022

liuxintong commented Oct 13, 2022

x13n left a comment

Choose a reason for hiding this comment

liuxintong commented Oct 21, 2022

liuxintong commented Oct 22, 2022

x13n commented Oct 25, 2022

liuxintong commented Oct 29, 2022

liuxintong commented Nov 3, 2022

k8s-ci-robot commented Nov 3, 2022

feiskyer commented Nov 3, 2022

liuxintong commented Nov 4, 2022

liuxintong commented Sep 17, 2022 •

edited

Loading