Intermitted etcd request timeouts #16624

Aldenar · 2023-09-20T13:41:32Z

Aldenar
Sep 20, 2023

I'm trying to deploy etcd into a docker swarm cluster consisting of 3 nodes. Deployment and initialization goes fine, however, when I then try to use the cluster using the etcdctl utility, I always manage to get one to three commands to execute fine... Then get a context deadline exceeded error.

The etcd containers should communicate using the host's LAN (As I do not know how to use some sort of dynamic container DNS name resolution in the swarm mode... Any clues here?)

My docker-compose config:

 etcd:
    image: bitnami/etcd:3.4.9
    user: root
    extra_hosts:
      - "docker-1.internal:192.168.0.15"
      - "docker-2.internal:192.168.0.25"
      - "docker-3.internal:192.168.0.35"
    volumes:
      - /var/lib/etcd:/etcd_data:rw
    environment:
      ETCD_DATA_DIR: /etcd_data
      ETCD_ENABLE_V2: "true"
      ALLOW_NONE_AUTHENTICATION: "yes"
      ETCD_NAME: "{{.Node.Hostname}}"
      ETCD_ADVERTISE_CLIENT_URLS: "http://{{.Node.Hostname}}.internal:2379"
      ETCD_LISTEN_CLIENT_URLS: "http://0.0.0.0:2379"
      ETCD_LISTEN_PEER_URLS: "http://0.0.0.0:2380"
      ETCD_INITIAL_CLUSTER: "docker-1=http://docker-1.internal:2380,docker-2=http://docker-2.internal:2380,docker-3=http://docker-3.internal:2380"
      ETCD_INITIAL_CLUSTER_STATE: "new"
      ETCD_INITIAL_CLUSTER_TOKEN: "token-00"
      ETCD_INITIAL_ADVERTISE_PEER_URLS: "http://{{ .Node.Hostname }}.internal:2380"
    ports:
      - "2379:2379/tcp"
      - "2380:2380/tcp"
    networks:
      - apisix
    deploy:
      mode: replicated
      replicas: 3
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 5
      update_config:
        parallelism: 1
        delay: 10s
      placement:
        max_replicas_per_node: 1

Following the deployment (docker stack deploy), I run these commands consecutively (With like... Less than a second inbetween:

root@docker-1:~# etcdctl --endpoints=docker-2.internal:2380 member list
3b4a962fff37892f, started, docker-1, http://docker-1.internal:2380, http://docker-1.internal:2379, false
506d5bf5dfe11942, started, docker-2, http://docker-2.internal:2380, http://docker-2.internal:2379, false
d26edfa3bc4f3dc1, started, docker-3, http://docker-3.internal:2380, http://docker-3.internal:2379, false

root@docker-1:~# etcdctl --endpoints=docker-2.internal:2380 member list
{"level":"warn","ts":"2023-09-20T15:29:42.957+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-54a3c004-5fc8-44a6-9562-6b553c9cd32c/docker-2.internal:2380","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

The only logs I am getting on the docker-2 instance are:

2023-09-20 13:29:37.813681 W | rafthttp: health check for peer 3b4a962fff37892f could not connect: dial tcp 192.168.0.15:2380: i/o timeout
2023-09-20 13:30:12.814165 W | rafthttp: health check for peer 3b4a962fff37892f could not connect: dial tcp 192.168.0.15:2380: i/o timeout
2023-09-20 13:30:47.814449 W | rafthttp: health check for peer 3b4a962fff37892f could not connect: dial tcp 192.168.0.15:2380: i/o timeout
2023-09-20 13:31:22.814711 W | rafthttp: health check for peer 3b4a962fff37892f could not connect: dial tcp 192.168.0.15:2380: i/o timeout

I do not know why the connection would timeout either. Ping is stable with under a millisecond of latency, and as far as I'm aware, I don't have any sort of iptables in place either. Running a rootful docker, the only firewall rules are those added by the docker daemon itself:

filter:
-A DOCKER-INGRESS -p tcp -m tcp --dport 2380 -j ACCEPT
nat:
-A DOCKER-INGRESS -p tcp -m state --state RELATED,ESTABLISHED -m tcp --sport 2380 -j ACCEPT
-A DOCKER-INGRESS -p tcp -m tcp --dport 2380 -j DNAT --to-destination 172.18.0.2:2380

Stranger still, this exact deployment worked for me fine running on a different set of 3 test docker VMs. And it works even when all 3 etcd nodes are on a single machine (For testing / development purposes)

Is there any further debugging I can do to get this working?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermitted etcd request timeouts #16624

{{title}}

Replies: 0 comments

Select a reply

Intermitted etcd request timeouts #16624

Aldenar Sep 20, 2023

Replies: 0 comments

Aldenar
Sep 20, 2023