Can I stop a "broken" etcd from communicating with "dead" nodes? #15885

alanbchristie · 2023-05-12T14:13:26Z

alanbchristie
May 12, 2023

I started with three etcd nodes, which were part of a Kubernetes cluster. I have since lost two of these nodes (let's call them "14" and "30") (i.e. data's either been lost or the machine has been erased) (let's not go into that). Now I am left with one "broken" etcd node (let's call it "47").

I use the term "broken" because: -

The container is constantly generating errors
I cannot run etcdctl in the container

The etcd container isn't crashing but it's repeatedly issuing the following to stdout/log relating to nodes 14 and 30 and issuing the error etcdserver: publish error: etcdserver: request timed out: -

raft2023/05/12 13:16:02 INFO: 9d7b6280a957bef4 is starting a new election at term 91873
raft2023/05/12 13:16:02 INFO: 9d7b6280a957bef4 became candidate at term 91874
raft2023/05/12 13:16:02 INFO: 9d7b6280a957bef4 received MsgVoteResp from 9d7b6280a957bef4 at term 91874
raft2023/05/12 13:16:02 INFO: 9d7b6280a957bef4 [logterm: 6495, index: 581312220] sent MsgVote request to abe2877743032652 at term 91874
raft2023/05/12 13:16:02 INFO: 9d7b6280a957bef4 [logterm: 6495, index: 581312220] sent MsgVote request to f6e724ffbc20e661 at term 91874
2023-05-12 13:16:03.581715 W | rafthttp: health check for peer abe2877743032652 could not connect: dial tcp 192.168.253.14:2380: i/o timeout
2023-05-12 13:16:03.584171 W | rafthttp: health check for peer f6e724ffbc20e661 could not connect: dial tcp 192.168.253.30:2380: connect: no route to host
2023-05-12 13:16:03.585250 W | rafthttp: health check for peer f6e724ffbc20e661 could not connect: dial tcp 192.168.253.30:2380: connect: no route to host
2023-05-12 13:16:03.586035 W | rafthttp: health check for peer abe2877743032652 could not connect: dial tcp 192.168.253.14:2380: i/o timeout
2023-05-12 13:16:07.994745 E | etcdserver: publish error: etcdserver: request timed out
2023-05-12 13:16:08.582211 W | rafthttp: health check for peer abe2877743032652 could not connect: dial tcp 192.168.253.14:2380: i/o timeout
2023-05-12 13:16:08.584834 W | rafthttp: health check for peer f6e724ffbc20e661 could not connect: dial tcp 192.168.253.30:2380: connect: no route to host
2023-05-12 13:16:08.586433 W | rafthttp: health check for peer abe2877743032652 could not connect: dial tcp 192.168.253.14:2380: i/o timeout
2023-05-12 13:16:08.586867 W | rafthttp: health check for peer f6e724ffbc20e661 could not connect: dial tcp 192.168.253.30:2380: connect: no route to host

I am unable to issue any etcdctl commands in the container because I'm met with the error context deadline exceeded, whatever I run: -

$ docker exec 04 etcdctl member list
{"level":"warn","ts":"2023-05-12T13:16:42.166Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-29edab1b-c8e2-4798-ba67-b51816915a73/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

I assume that one etcd node should still be able to operate?

What might be causing the etcdctl error? (context deadline exceeded)
Can I use etcdctl on the db file simply to remove the unresponsive members?
At startup can I force the etcd service to forget any prior members?

Answered by alanbchristie

May 12, 2023

I have found a solution for my scenario!

It's inspired by the articles breaking-down-and-fixing-etcd-cluster and how-to-start-a-stopped-docker-container-with-a-different-command.

The Loss of quorum section in the first tells me I probably need the --force-new-cluster flag in my etcd container. To do this I use the second article to illustrate how to stop the container, edit the JSON (the run-command arguments in my case), which contains the start command and then restart docker and the container.

Once I do this my etcd service starts as a single node and continues to operate as normal and (crucially) givens me etcdctl again...

$ docker exec 049 etcdctl member list -w table
+--------------…

View full answer

alanbchristie · 2023-05-12T16:48:10Z

alanbchristie
May 12, 2023
Author

I have found a solution for my scenario!

It's inspired by the articles breaking-down-and-fixing-etcd-cluster and how-to-start-a-stopped-docker-container-with-a-different-command.

The Loss of quorum section in the first tells me I probably need the --force-new-cluster flag in my etcd container. To do this I use the second article to illustrate how to stop the container, edit the JSON (the run-command arguments in my case), which contains the start command and then restart docker and the container.

Once I do this my etcd service starts as a single node and continues to operate as normal and (crucially) givens me etcdctl again...

$ docker exec 049 etcdctl member list -w table
+------------------+---------+---------------------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  |        NAME         |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+---------------------+-----------------------------+-----------------------------+------------+
| 9d7b6280a957bef4 | started | etcd-192.168.253.47 | https://192.168.253.47:2380 | https://192.168.253.47:2379 |      false |
+------------------+---------+---------------------+-----------------------------+-----------------------------+------------+

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I stop a "broken" etcd from communicating with "dead" nodes? #15885

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Can I stop a "broken" etcd from communicating with "dead" nodes? #15885

alanbchristie May 12, 2023

Replies: 1 comment

alanbchristie May 12, 2023 Author

alanbchristie
May 12, 2023

alanbchristie
May 12, 2023
Author