-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4563: EvictionRequest API (fka Evacuation) #4565
base: master
Are you sure you want to change the base?
Conversation
atiratree
commented
Mar 28, 2024
•
edited
Loading
edited
- One-line PR description: introduce EvictionRequest API to allow managed graceful pod removal
- Issue link: EvictionRequest API #4563
- Other comments: part of the functionality has been split from KEP-4212: Declarative Node Maintenance #4213 into this KEP
We will introduce a new term called evacuation. This is a contract between the evacuation instigator, | ||
the evacuee, and the evacuator. The contract is enforced by the API and an evacuation controller. | ||
We can think of evacuation as a managed and safer alternative to eviction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watch out for the risk of confusing end users.
We already have preemption and eviction and people confuse the two. Or three, because there are two kinds of eviction. And there's disruption in the mix.
Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I added a mention which of what kind of eviction I mean here.
Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?
Yes, I think we want to add a new concept there and generally update the docs once we have an alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with Tim here.
Preemption vs Eviction is already quite confusing. And TBH, I couldn't fully understand what the "evacuation" is supposed to solve by reading the summary or motivation.
From Goals:
Changes to the eviction API to support the evacuation process.
If this is already going to be part of the Eviction API, maybe it should be named as a form of eviction. Something like "cooperative eviction" or "eviction with ack" or something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for framing it as another type of eviction; we already have two, so the extra cognitive load for users is not so much a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor I have updated the summary and goals, I hope it makes more sense now.
I think the name should make the most sense to the person creating the Evacuation (Evacuation Instigator ATM). So CooperativeEviction
or EvictionWithAck
is a bit misleading IMO. Because from that person's perpective there is no additional step required of them. Only the evacuators and the evacuation controller implement the cooperative evacuation process but this is hidden from the normal user.
My suggestions:
GracefulEviction
(might confuse people if it is associated with graceful pod termination, which it is not)SafeEviction
(*safer than the API-initiated one for some pods)- Or just call it
Eviction
? And tell people to use it instead of the Eviction API endpoint? This might be a bit confusing (at least in the beginning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can bikeshed names for the API kind; I'd throw a few of my own into the hat:
- EvictionRequest
- PodEvictionRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed the API to EvictionRequest
to make the term recognizable. A minor disadvantage is that we have to clarify what type of eviction we mean if we say evict (API-initiate eviction, or EvictionRequest)
The rest of the renames are as follows:
Evacuation (noun) -> EvictionRequest / Eviction Process
evacuation (verb) -> request an eviction / terminate / evict / process eviction
Evacuator -> Interceptor
Evacuee -> Pod
Evacuator Class -> Interceptor Class
Evacuation Instigator -> Eviction Requester
Evacuation Controller -> Eviction Request Controller
ActiveEvacuatorClass -> ActiveInterceptorClass
ActiveEvacuatorCompleted -> ActiveInterceptorCompleted
EvacuationProgressTimestamp -> ProgressTimestamp
ExpectedEvacuationFinishTime -> ExpectedInterceptorFinishTime
EvacuationCancellationPolicy -> EvictionRequestCancellationPolicy
FailedEvictionCounter -> FailedAPIEvictionCounter
<!-- | ||
What other approaches did you consider, and why did you rule them out? These do | ||
not need to be as detailed as the proposal, but should include enough | ||
information to express the idea and why it was not acceptable. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a number of examples of having a SomethingRequest or SomethingClaim API that then causes a something (certificate signing, node provisioning, etc).
Think of TokenRequest (a subresource for a ServiceAccount), or CertificateSigningRequest.
I would confidently and strongly prefer to have an EvictionRequest or PodEvictionRequest API, rather than an Evacuation API kind.
It's easy to teach that we have evictions and than an EvictionRequest is asking for one to happen; it's hard to teach the difference between an eviction and an evacuation.
As a side effect, this makes the feature gate easier to name (eg PodEvictionRequests
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you have mentioned, we already have different kinds of eviction. So I think it would be good to use a completely new term to distinguish it from the others.
Also, Evacuation does not always result in eviction (and PDB consultation). It depends on the controller/workload. For some workloads like DaemonSets and static pods, API eviction has never worked before. This could also be very confusing if we name it the same way.
I think Evacuation fits this better because
- The name is shorter. If we go with EvacuationRequest then the evacuation will become just an abstract term and less recognizable.
- It seems it will have quite a lot of functionality included (state synchronization between multiple instigators and multiple evacuators, state of the evacuee and evacuation). TokenRequest and CertificateSigningRequest are simpler and not involved in a complex process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested EvictionRequest so that we don't have to have a section with the (too long) title: Scheduling, Preemption, Evacuation and Eviction. Not EvacuationRequest.
Adding another term doesn't scale so well: it means helping n people understand the difference between evacuation and eviction. It's a scaling challenge where n is not only large, it probably includes quite a few Kubernetes maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for CertificateSigningRequest being simple: I don't buy it. There are three controllers, custom signers, an integration with trust bundles, the horrors of ASN.1 and X.509… trust me, it's complicated enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it will be confusing for people, but that will happen regardless of what term we will use.
My main issue is that evacuation does not directly translate to eviction. Therefore, I think it would be preferable to choose a new term (not necessarily evacuation).
I would like to get additional opinions from people about this. And we will definitely have to come back to this in the API review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved now: #4565 (comment)
13611ce
to
2d15b79
Compare
2d15b79
to
c292415
Compare
|
|
||
Example evacuation triggers: | ||
- Node maintenance controller: node maintenance triggered by an admin. | ||
- Descheduler: descheduling triggered by a descheduling rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the descheduler requests an eviction, what thing is being evacuated?
(the node maintenance and cluster autoscaler examples are easier: you're evacuating an entire node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single pod or multiple pods. The descheduler can use it as a new mechanism instead of eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so people typically think of “evacuate” as a near synonym of “drain” - you drain a node, you evacuate a rack or zone full of servers. Saying that you can evacuate a Pod might make people think its containers all get stopped, or just confuse readers. We do need to smooth out how we can teach this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it can be used in both scenarios https://www.merriam-webster.com/grammar/can-you-evacuate-people.
Evacuation of containers doesn't make sense because they are tied to the pod lifecycle. But, I guess it could be confusing if we do not make it explicitly clear what we are targeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is, Kubernetes users typically - before we accept this KEP - use “evacuate” as a synonym for drain.
I'm (still) fine with the API idea, and still concerned about the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to +1 the potential confusion of the term "evacuation".
Is it okay to have a glossary of terms for "evacuation", "eviction", and "drain" (or any other potentially confusing terms) added somewhere in this KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can include it in the KEP. And yes, we are going to change the name to something more suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API is renamed to EvictionRequest now, see #4565 (comment) for more details
1900020
to
085672a
Compare
085672a
to
f81cc8c
Compare
b833a5c
to
24e034e
Compare
indicate that the pod Evacuation is complete if the pod `restartPolicy` allows it.
- move ${POD_UID}-${POD_NAME_PREFIX} format to alternatives
- better describe Evacuator responsibilities - add new PDB issues - add a follow-up to the beta graduation to track evacuations in pods
24e034e
to
b6f4d01
Compare
- Update the followups. - Add proposal summary. - Add practical use cases to the proposal summary and to the followups. - Deployment Pod Surge Example - HorizontalPodAutoscaler Pod Surge Example - Descheduling and Downscaling All renames for a reference: Evacuation (noun) -> EvictionRequest / Eviction Process evacuation (verb) -> request an eviction / terminate / evict / process eviction Evacuator -> Interceptor Evacuee -> Pod Evacuator Class -> Interceptor Class Evacuation Instigator -> Eviction Requester Evacuation Controller -> Eviction Request Controller ActiveEvacuatorClass -> ActiveInterceptorClass ActiveEvacuatorCompleted -> ActiveInterceptorCompleted EvacuationProgressTimestamp -> ProgressTimestamp ExpectedEvacuationFinishTime -> ExpectedInterceptorFinishTime EvacuationCancellationPolicy -> EvictionRequestCancellationPolicy FailedEvictionCounter -> FailedAPIEvictionCounter
b6f4d01
to
3c33535
Compare
9df74ad
to
7772213
Compare
KEP Update; notable changes:
Added practical use cases/example scenarios: |
@thockin, in the workloads space there are cross-cutting concern issues where such coordination is necessary (PTAL at examples above).
Such deletion gates could work in a system with well-meaning and diligent actors, but difficult to achieve in practice:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read through the initial requirements and assumptions, I haven't read carefully through the Eviction protocol and the contract described in the latter part of this document.
to edit the PDB to account for the additional pod that needs to be spun to move the workload | ||
from one node to another. This has been discussed in issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811) | ||
and in issue [kubernetes/kubernetes#114877](https://github.com/kubernetes/kubernetes/issues/114877). | ||
2. Similar to the first point, it is difficult to use PDBs for applications that can have a variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case seems identical with the first one. In both cases you're covering a single-replica application. How it got there (always has been, or was scaled down) doesn't matter, imo. So I'd advice to squash the two together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HPAs are fun here though because they can heal in ways are otherwise left to either deployment/statefulset controllers. So maybe it should be positioned as a consumer of this api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is similar that it covers scaling of a single replica application, but the impact and followups are different. HPA covers 1-n replicas by default. So I think it is okay to keep it separate.
So maybe it should be positioned as a consumer of this api?
HPA is interesting because it can be both a requester (e.g. during scaling down) and an interceptor (e.g. during node drain). As is shown in the examples in the KEP.
|
||
We should discourage the creation of preventive EvictionRequests, so that they do not end up as | ||
another PDB. So we should design the API appropriately and also not allow behaviors that do not | ||
conform to the eviction request contract. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't enforce it in any way, and the API will get used in various, unpredictable ways (see PDB use cases, which you nicely outlined in the sections above). I understand the sentiment, but I don't think it will help here in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answered in #4565 (comment)
|
||
Its responsibility is to observe eviction requests from requesters and periodically check that | ||
interceptors are making progress in evicting/terminating pods. It is important to see a consistent | ||
effort by the interceptors to reconcile the progress of the eviction request. This is important to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you planning to enforce the progress here? How can you tell that a progress stuck for several minutes is good or bad? Every application will define that differently, so it's hard to define one-size fits all. Even if you require an update every X amount of time, that can be overused by dummy updates, just to block longer/indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to enforce the progress. This is modelled after leader election. If an interceptor fails to update the EvictionRequest status to report the progress we will choose another interceptor or evict the pod.
Leader election works fine with much lower timeouts. I think we can be more forgiving here, but the 10m mark (allowed minimum) should be sufficient.
that can be overused by dummy updates, just to block longer/indefinitely.
The same could be said about the leader election, which could be used to hijack an existing controller.
The idea behind this is that we should make it very unpleasant to block eviction (proof of work instead of just creating a PDB). And if someone implements this, they might as well implement graceful termination.
Another benefit of this approach is that an administrator can take a look at a very long EvictionRequest and see what InterceptorClass is set on it and can easily track down the misbehaving actor. We can also hook into this with metrics/alerts.
Finally, since this is a normal k8s object, we can use RBAC to restrict and observer access to the EvictionRequest.
|
||
Tracking the eviction status in a pod would be feasible, but it would be hard to coordinate between | ||
the actors as mentioned. So it would be better to decouple the eviction request process from the pod | ||
as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered improving PDB to provide more insightful status about eviction, which would allow external controllers to watch that status and react accordingly (either by scaling the app up, or doing the actions you're describing above)? This way the whole process could be tracked and synchronized outside of the core resources, and k8s would only expose sufficient information to provide that kind of automation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The main problems are cardinality and synchronization, we have a set of pods guarded by a single PDB and there is a group of actors (requesters and interceptors) interested in the lifecycle of these pods. So tracking the state of the pods in the status would not scale well. An aggregate status would be less useful and would still run into update conflicts.
- High API server load: as this status would only be used on eviction, it will have outdated information. Actors interacting with a PDB will want to have the latest information about the pods (e.g. recalculating descheduling rules according to the cpu/mem utilization) and might do frequent status updates, congesting the API server.
- Not all applications use PDBs. The EvictionRequest API tries to serve as a replacement for eviction and as the main descheduling primitive. This has use cases in cluster and pod autoscaling. For example when used with HPA, we would like to downscale any application/set of pods.
See some practical use cases for this feature: | ||
1. Ability to upscale first before terminating the pods with a Deployment: [Deployment Pod Surge Example](#deployment-pod-surge-example) | ||
based on the [EvictionRequest Process](#evictionrequest-process). | ||
2. Ability to upscale first before terminating the pods with HPA: [HorizontalPodAutoscaler Pod Surge Example](#horizontalpodautoscaler-pod-surge-example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess dragging along the pdb to have minAvailable == HPA's current target is a pretty hacky thus the need for a non pdb blocked signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are a bunch of problems here, user friendliness, atomicity of actions, and responding to any descheduling (e.g. node drain).
Not sure if it's relevant to the conversation, however I would expect that after this change, the user scenario 2 is solved without the user doing anything, as the maxSurge parameter and the fact that it's a rollout deployment are already specified in the deployment spec. |
Yes, the implementation of the follow-ups is only outlined, here and there should be additional KEPs for each improvement. And the majority of the things proposed should result in an immediate benefit without any user action. |
d6992ce
to
5635ffb
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: atiratree The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |