Etcd Unstable - Container spawning multiple PIDs, Context Deadline Exceeded warnings in log continuously #16704
Replies: 5 comments 4 replies
-
Seams to me that your cluster is getting overloaded.
I don't think this qualifies as a bug. Probably it's best to move to https://github.com/etcd-io/etcd/discussions |
Beta Was this translation helpful? Give feedback.
-
Based on my experience I would guess this is an issue with ReadIndex requests being dropped during leader election. This was fixed in newer versions of etcd, so please upgrade to newest v3.4 version. |
Beta Was this translation helpful? Give feedback.
-
@serathius - We have upgraded to etcd v3.5.6. However, the issue is still seen. Any idea on this ? |
Beta Was this translation helpful? Give feedback.
-
@lavacat - We have checked the metrics and observations are as follows - etcd_server_proposals_pending - It is mostly 0. We are seeing this issue on a single-node cluster too. Please let me know if any other metrics would help to understand the issue. |
Beta Was this translation helpful? Give feedback.
-
Did you check the IO pressure (iostat -dmx 1) or (/proc/loadavg)?
Since you have 80 cpus, if the etcd server has pressure somehow, the go runtime might clone threads to consume the requests |
Beta Was this translation helpful? Give feedback.
-
What happened?
We have multiple multi-node baremetal clusters running on etcd v3.4.3 where etcd instability is observed on longevity(after 20-30 days).
The instability starts with Context Deadline Exceeded warnings in etcd logs as it is taking long time to respond to kubeapi server livez requests. This also resulting in restarting of kubeapi server due to continuous failures of liveness probes.
Jan 30 14:27:15 cap-5n-ibp-cz-1.aqurryeyna.com etcd[57471]: {"level":"warn","ts":"2023-01-30T14:27:15.261Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12640906947059278029,"retry-timeout":"500ms"} Jan 30 14:27:15 cap-5n-ibp-cz-1.aqurryeyna.com etcd[57471]: {"level":"warn","ts":"2023-01-30T14:27:15.761Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12640906947059278029,"retry-timeout":"500ms"} Jan 30 14:27:16 cap-5n-ibp-cz-1.aqurryeyna.com etcd[57471]: {"level":"warn","ts":"2023-01-30T14:27:16.261Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12640906947059278029,"retry-timeout":"500ms"} Jan 30 14:27:16 cap-5n-ibp-cz-1.aqurryeyna.com etcd[57471]: {"level":"warn","ts":"2023-01-30T14:27:16.760Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"1.999967528s","expected-duration":"101ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"}
In few days the frequency of Context Deadline Exceeded warnings in logs increases along with etcd restarting multiple times. This is leading to multiple etcd leader elections and soon cluster gets into complete unstable state.
Once the cluster gets into this state kubeapi server is not getting a chance to recover as etcd is continuously restarting. To understand the issue we have also checked the docker stats of etcd containers on the cluster. We have observed that PIDs count is too high around 50-70, while on a stable cluster they are around 10-15. It seems etcd is continuously spawning up multiple threads to form a cluster as it got into no leader state due to restarts on all nodes.
Below are the further details on the cluster
Setup Configuration
H/W of each node
What did you expect to happen?
The expectation is etcd stays stable even with longevity.
How can we reproduce it (as minimally and precisely as possible)?
A multi-node etcd cluster with longevity of around 20-30 days will help to reproduce the issue.
Etcd version
Etcd configuration
Etcd debug information
Beta Was this translation helpful? Give feedback.
All reactions