-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support scaling down a particular node with graceful termination #5187
Conversation
Welcome @liuxintong! |
All changes in this PR have been verified in the following AKS cluster
The autoscaler in AKS side was turned off intentionally, and a test
By the way, here was the main container image spec define in
Once the
The next step was checking the container logs periodically. In the first round, the node group was scaled up from 3 to 4, although it has no unschedulable pods. This is by design, because we need a new surging node to scale down the tagged node.
In the second round, the node with
In the next round after the scale down cooldown period, the node
In addition, after deploying the test container image, all pods in
|
@x13n - I see you are doing some refactoring, please let me know if you have any concern about this feature before I rebase my PR. |
Hi @liuxintong , thanks for sending the change! I'll try to do a proper review next week. I'm indeed moving quite a lot of scale down logic around, so it will be better to wait until then with merging this PR. One high level comment I have for now is that maybe it makes sense to split this into 2 PRs? Optional enforcement of min size is a feature in itself and doesn't interfere with my changes, so maybe it could be done first. |
Thanks @x13n! Splitting into 2 pull requests also make sense to me. I'll make it in a new PR. Please let me know once your scale down optimization is done, so that I can implement the new feature based on your changes. |
@x13n is the conflicting refactor now complete? |
Yes it is! The work here shouldn't be blocked on anything now. |
abbaeaa
to
dbf2a54
Compare
/retest |
@liuxintong: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
@liuxintong: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test all |
@liuxintong: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
dbf2a54
to
b68ca85
Compare
b68ca85
to
f8219c1
Compare
In addition to unit tests, this PR has been verified in an Azure Kubernetes cluster. The following example shows how CA scales down a node that hosts the cluster-autoscaler pod. @x13n / @MarcPow / @feiskyer, this PR is ready for review, please help take a look, thanks!
|
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to review the code and you can see a bunch of my comments on some of the files, but I actually started to doubt this is something we should be adding to Cluster Autoscaler. This isn't really about autoscaling, it is about automatic node repairs. If I understand correctly, the use case here is to manually tag certain broken nodes for removal. This looks like something that could be already achieved by cordoning/draining node with kubectl
, followed by VM removal via cloud provider API. The min size enforcement will then kick in, if necessary. WDYT?
cluster-autoscaler/FAQ.md
Outdated
@@ -32,6 +32,7 @@ this document: | |||
* [How can I see all the events from Cluster Autoscaler?](#how-can-i-see-all-events-from-cluster-autoscaler) | |||
* [How can I scale my cluster to just 1 node?](#how-can-i-scale-my-cluster-to-just-1-node) | |||
* [How can I scale a node group to 0?](#how-can-i-scale-a-node-group-to-0) | |||
* [How can I request Clsuter Autoscaler to scale down a particular node?](#how-can-i-request-clsuter-autoscaler-to-scale-down-a-particular-node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo (here and in other places): clsuter->cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed all typos.
cluster-autoscaler/FAQ.md
Outdated
|
||
``` | ||
kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested=30 | ||
kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a note that while one can remove the annotation, it doesn't guarantee the node won't be removed. If Cluster Autoscaler already started draining the node, removing the annotation will have no effect. I think this is an important caveat to document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added the disclaimer.
cluster-autoscaler/FAQ.md
Outdated
|
||
Starting with CA 1.26.0, nodes will be evicted by CA if it has the annotation requesting scale-down. | ||
* The annotation key is `cluster-autoscaler.kubernetes.io/scale-down-requested`. | ||
* The annotation value is a number representing the max graceful termination seconds for pods hosted on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to rename the annotation so that the meaning of value doesn't require reading the FAQ? I was thinking about something along the lines of cluster-autoscaler.kubernetes.io/enforced-scale-down-graceful-termination-seconds
, but this is a bit long, hope you have a better idea :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense, I renamed it to cluster-autoscaler.kubernetes.io/force-scale-down-with-grace-period-minutes
, but I'm not sure if this is a better name.
Btw, I've moved from annotations
to taints
.
cluster-autoscaler/core/scale_up.go
Outdated
continue | ||
} | ||
if len(groupsWithNodes[ng]) == 0 { | ||
groupsWithNodes[ng] = make([]*apiv1.Node, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary, append(nil, node)
will already return a single-element slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right.
cluster-autoscaler/core/scale_up.go
Outdated
@@ -486,7 +509,7 @@ func ScaleUpToNodeGroupMinSize(context *context.AutoscalingContext, processors * | |||
continue | |||
} | |||
|
|||
newNodeCount := ng.MinSize() - targetSize | |||
newNodeCount := ng.MinSize() + scaleDownRequestedCount - targetSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried this will cause nodes to be removed right after adding. Consider the following scenario:
- Annotation gets added to a node n1
- Scale down starts to consider n1 for deletion
- This code triggers a scale up and creates node n2
- Annotation gets added to a node n3
- Scale down considers n2 instead of n3 for deletion, because it is empty
- This code triggers a scale up and creates n4
- Scale down removes n1 (unneeded long enough)
- Scale down removes n2 (unneeded long enough)
- This code has to create another replacement for n3
If there's no special handling of annotated nodes here, the scale up to min is purely reactive, which would cause node count to sometimes go below min, but then recover.
Another problematic scenario:
- Node n1 gets annotated
- Node n1 starts getting drained
- This code triggers creation of a new node n2
- Pods evicted from n1 are recreated and manage to schedule on other existing nodes in the cluster
- Node n2 becomes ready, but is empty and scale down has to delete it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for thinking thru all possible scenarios.
Here are some logics we have added to avoid node churning caused by the force-scale-down nodes:
- The scale-up will be triggered only if the pods on the force-scale-down node cannot be rescheduled on existing nodes.
- The scale-down will be triggered only if the node is still unneeded after rescheduling all pods from the force-scale-down nodes.
- If the scale-down candidates have multiple nodes, the force-scale-down node will have higher priority.
@@ -224,6 +225,19 @@ func evictPod(ctx *acontext.AutoscalingContext, podToEvict *apiv1.Pod, isDaemonS | |||
} | |||
} | |||
|
|||
if utils.HasScaleDownRequestedAnnotation(node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to calculate maxTermination
becomes quite complicated with this change, please extract it to a separate function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I've moved all draining related logic to k8s.io/autoscaler/cluster-autoscaler/simulator/drainability/rules/forcescaledown
.
Btw, the drainability rule is a nice refactoring in the recent year.
@@ -135,6 +131,16 @@ func (c *Checker) unremovableReasonAndNodeUtilization(context *context.Autoscali | |||
return simulator.NotAutoscaled, nil | |||
} | |||
|
|||
utilInfo, err := utilization.Calculate(nodeInfo, context.IgnoreDaemonSetsUtilization, context.IgnoreMirrorPodsUtilization, context.CloudProvider.GPULabel(), timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why move this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this part in the new iteration. We only have very few changes in this file.
Given my reasoning above and lack of activity here, I'm going to close this PR. Please reopen if you disagree. /close |
@x13n: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Reopening as I'm currently working on it. |
@liuxintong: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: liuxintong The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @liuxintong ! Can you clarify what is the benefit of having this logic in autoscaler? IIUC the user will observe more or less the same behavior by annotating the pod as if they just did |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@x13n - I was going to reply to your previous comments, but I didn't get any time until today. Apologized for the delay.
Hi @liuxintong ! Can you clarify what is the benefit of having this logic in autoscaler?
Technically, we can developer a new cluster controller to meet all the business needs. But it would have lots of duplicate logics, just like what we have here. Then we also need to resolve scale-down conflicts between the new controller and cluster-autoscaler. CA already has a mature implementation in this area, like scale-down simulation, node draining rules, cloud provider integration, etc. I'd like to leverage it to achieve the goal of scaling down a specific node.
Another motivation is that new node provisioning takes longer on Windows than on Linux, and we want to reduce the pod pending time to minimize the impact on services. When we need to scale down a node, we can scale up a new node at the same time. Then the pods can be moved from the old node to the new node as quickly as possible.
IIUC the user will observe more or less the same behavior by annotating the pod as if they just did
kubectl drain
.
You are right if we only have a few clusters to operate. However, we mange thousands of clusters. This simple problem gets complicated when the scale is up.
For the 2 options you mentioned, I think the main differences are as follows:
kubectl annotate
/kubectl taint
: it returns immediately, and we can guarantee success with the logic in CA.kubectl drain
: we need to track the execution externally, and the node draining might be failed (no guarantee).
In addition, all related code logics are controlled by the new flag --force-scale-down-enabled
, and the default value is false
. If others don't need it, they won't feel any difference.
@MarcPow also shared more context in Issue #5109. Please let us know if you have any additional concerns. Thank you!
// than the configured min size. The source of truth for the current node group | ||
// size is the TargetSize queried directly from cloud providers. Returns | ||
// than the required min size, which is calculated based on the node group min | ||
// size configuration and the number of force-scale-down tainted nodes. Returns | ||
// appropriate status or error if an unexpected error occurred. | ||
func (o *ScaleUpOrchestrator) ScaleUpToNodeGroupMinSize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kisieland / @BigDarkClown - Thanks for reviewing PR #5663. I'm fixing Issue #5624 here, please take another look.
FYI: @mwielgus
cluster-autoscaler/FAQ.md
Outdated
@@ -32,6 +32,7 @@ this document: | |||
* [How can I see all the events from Cluster Autoscaler?](#how-can-i-see-all-events-from-cluster-autoscaler) | |||
* [How can I scale my cluster to just 1 node?](#how-can-i-scale-my-cluster-to-just-1-node) | |||
* [How can I scale a node group to 0?](#how-can-i-scale-a-node-group-to-0) | |||
* [How can I request Clsuter Autoscaler to scale down a particular node?](#how-can-i-request-clsuter-autoscaler-to-scale-down-a-particular-node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed all typos.
cluster-autoscaler/FAQ.md
Outdated
|
||
``` | ||
kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested=30 | ||
kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added the disclaimer.
cluster-autoscaler/FAQ.md
Outdated
|
||
Starting with CA 1.26.0, nodes will be evicted by CA if it has the annotation requesting scale-down. | ||
* The annotation key is `cluster-autoscaler.kubernetes.io/scale-down-requested`. | ||
* The annotation value is a number representing the max graceful termination seconds for pods hosted on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense, I renamed it to cluster-autoscaler.kubernetes.io/force-scale-down-with-grace-period-minutes
, but I'm not sure if this is a better name.
Btw, I've moved from annotations
to taints
.
cluster-autoscaler/core/scale_up.go
Outdated
continue | ||
} | ||
if len(groupsWithNodes[ng]) == 0 { | ||
groupsWithNodes[ng] = make([]*apiv1.Node, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right.
cluster-autoscaler/core/scale_up.go
Outdated
@@ -486,7 +509,7 @@ func ScaleUpToNodeGroupMinSize(context *context.AutoscalingContext, processors * | |||
continue | |||
} | |||
|
|||
newNodeCount := ng.MinSize() - targetSize | |||
newNodeCount := ng.MinSize() + scaleDownRequestedCount - targetSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for thinking thru all possible scenarios.
Here are some logics we have added to avoid node churning caused by the force-scale-down nodes:
- The scale-up will be triggered only if the pods on the force-scale-down node cannot be rescheduled on existing nodes.
- The scale-down will be triggered only if the node is still unneeded after rescheduling all pods from the force-scale-down nodes.
- If the scale-down candidates have multiple nodes, the force-scale-down node will have higher priority.
@@ -224,6 +225,19 @@ func evictPod(ctx *acontext.AutoscalingContext, podToEvict *apiv1.Pod, isDaemonS | |||
} | |||
} | |||
|
|||
if utils.HasScaleDownRequestedAnnotation(node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I've moved all draining related logic to k8s.io/autoscaler/cluster-autoscaler/simulator/drainability/rules/forcescaledown
.
Btw, the drainability rule is a nice refactoring in the recent year.
@@ -135,6 +131,16 @@ func (c *Checker) unremovableReasonAndNodeUtilization(context *context.Autoscali | |||
return simulator.NotAutoscaled, nil | |||
} | |||
|
|||
utilInfo, err := utilization.Calculate(nodeInfo, context.IgnoreDaemonSetsUtilization, context.IgnoreMirrorPodsUtilization, context.CloudProvider.GPULabel(), timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this part in the new iteration. We only have very few changes in this file.
I still don't see the benefit over just kubectl drain |
If the only difference between `kubectl drain` and `kubectl annotate` is
the need to wait for actuation, then perhaps
kubernetes/enhancements#4212 is going to address
this use case better? I think once this KEP lands, Cluster Autoscaler
should start relying on it as well.
|
That KEP provides a better authorization story; we can allow users (and controllers) to request node drains, and we can allow a controller to trigger draining even without allowing it to write to a node; labelling nodes can break expectations around workload isolation. @liuxintong if you're willing to contribute to defining that KEP, I think it provides a good way forward. The work you've done on this PR can help ensure that the cluster autoscaler is ready for the arrival of declarative node drains. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I'm closing this one due to inactivity. Looks like long term we can depend on declarative node maintenance for this use case. /close |
@x13n: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Which component this PR applies to?
cluster-autoscaler
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces a new feature to support scaling down a particular node. As described in #5109, some nodes could become non-functional for some reason specific to the cloud provider. With this change, users can tag a node explicitly by running
kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-requested=true
. Cluster-Autoscaler will perform all required safe checks, evict hosted pods and delete the node.This PR also introduces the feature to scale up the cluster if the current node group size is smaller than the configured min size. This feature is disabled by default, and users need to pass in the bool flag
scale-up-to-meet-node-group-min-size-enabled
to enable it.The first feature is slightly depend on the second feature. For example, the min size of a node group is 3, and it has exactly 3 nodes at this moment. If we tag a node to scale down, it wouldn't work because the node group is already at the min size, and the cluster could only be scaled up if it has unschedulable pods to help. Therefore, we need this change to take the
scale-down-requested
annotation into account and scale up the node group a bit if it's needed.Which issue(s) this PR fixes:
Fixes #5109
Special notes for your reviewer:
Besides unit tests, this PR has been fully validated in an Azure K8s cluster. I will share more details about my experiments in a separate comment of this PR.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: