Increase the appropriate resource limits for pods after determining if a pod is being CPU throttled or OOMKilled
.
Return Kubernetes pods to a healthy state with resources available.
- The names of the pods hitting their resource limits are known. See Determine if Pods are Hitting Resource Limits.
-
(
ncn-mw#
) Determine the current limits of a pod.kubectl get po -n services POD_ID -o yaml
Look for the following section returned in the output:
resources: limits: cpu: "2" memory: 2Gi requests: cpu: 10m memory: 64Mi
-
(
ncn-mw#
) Determine which Kubernetes entity (etcdcluster
,deployment
,statefulset
, etc) is creating the pod.The Kubernetes entity can be found with either of the following options:
-
Find the Kubernetes entity and
grep
for the pod in question.In the following example, replace
hbtd-etcd
with the pod being used.kubectl get deployment,statefulset,etcdcluster,postgresql,daemonsets -A | grep hbtd-etcd
Example output:
services etcdcluster.etcd.database.coreos.com/cray-hbtd-etcd 32d
-
Describe the pod and look in the
Labels
section.This section is helpful for tracking down which entity is creating the pod.
kubectl describe pod -n services POD_ID
Excerpt from example output:
Labels: app=etcd etcd_cluster=cray-hbtd-etcd etcd_node=cray-hbtd-etcd-8r2scmpb58
-
-
(
ncn-mw#
) Edit the entity.In the example below, be sure to replace
ENTITY_TYPE
andENTITY_NAME
with the values determined in the previous step (in the example output for the following step, these would beetcdcluster
andcray-hbtd-etcd
, respectively).kubectl edit ENTITY_TYPE -n services ENTITY_NAME
-
(
ncn-mw#
) Increase the resource limits for the pod.resources: {}
Replace the text above with the following section, increasing the limits values:
resources: limits: cpu: "4" memory: 8Gi requests: cpu: 10m memory: 64Mi
-
(
ncn-mw#
) Run a rolling restart of the pods.kubectl get po -n services | grep ENTITY_NAME
Example output:
cray-hbtd-etcd-8r2scmpb58 1/1 Running 0 5d11h cray-hbtd-etcd-qvz4zzjzw2 1/1 Running 0 5d11h cray-hbtd-etcd-vzjzmbn6nr 1/1 Running 0 5d11h
-
(
ncn-mw#
) Kill the pods off one by one.Wait for each replacement pod to come up and be in a
Running
state before proceeding to the next pod.kubectl -n services delete pod POD_ID
-
(
ncn-mw#
) Verify that all pods are nowRunning
with a more recent age.kubectl get po -n services | grep ENTITY_NAME
Example output:
cray-hbtd-etcd-8r2scmpb58 1/1 Running 0 12s cray-hbtd-etcd-qvz4zzjzw2 1/1 Running 0 32s cray-hbtd-etcd-vzjzmbn6nr 1/1 Running 0 98s