After "error":"invalid database", etcd-0 pod is in restart loop, unable to retrieve db with the help of the other 2 running pods (without using the snapshot backup)) #15991
Replies: 10 comments 2 replies
-
Given it's just one, delete its data dir (remove the PV/PVC?), remove the member and restart the pod and add it as a member again. It should sync from the existing qourum of the other two. There's more info here: https://etcd.io/docs/v3.5/op-guide/recovery/ How are you running the etcd pods, inside k8s? |
Beta Was this translation helpful? Give feedback.
-
Most likely the bbolt db file is corrupted again. @manish-raut is it test or production environment? Could you share the db file if possible? If not, please run the bbolt check <db_file> command. |
Beta Was this translation helpful? Give feedback.
-
@tjungblu Yes we are running inside k8s v1.23.6. with containerd runtime. I will try again with your mentioned steps and update but dint worked earlier when I tried. Will share the logs if it fails again. The documentation links mostly has the steps considering backups of the snapshots. |
Beta Was this translation helpful? Give feedback.
-
@ahrtr in our case the file is /data/member/snap/db, I will try to run the commands you mention but sorry I won't be able to share it thought its a preprod-test environment. |
Beta Was this translation helpful? Give feedback.
-
@manish-raut Please consider providing a redacted dump of data |
Beta Was this translation helpful? Give feedback.
-
Hello, but sorry I cant copy or provide anything with db from that environment.
And the pod again gets in the restart loop |
Beta Was this translation helpful? Give feedback.
-
Hello, Do we have any other workaround to fix this? |
Beta Was this translation helpful? Give feedback.
-
If you don't provide what we requested, then it's impossible to figure out the root cause. Previously we provided two commands, could you execute them and provide feedback?
Please follow https://etcd.io/docs/v3.5/op-guide/runtime-configuration/#replace-a-failed-machine |
Beta Was this translation helpful? Give feedback.
-
Hi @ahrtr , |
Beta Was this translation helpful? Give feedback.
-
Hello, |
Beta Was this translation helpful? Give feedback.
-
What happened?
We have HA setups of 3 pods. After the power failure while applying the patch, a db failure has happened getting "panic: failed to open database" error in logs for the etcd HA setups of 3 pods. The 1 pod is in a restart loop, and we are unable to retrieve the db using the snapshot backups as we don't have any. We want to retrieve the db state with the help of the other 2 pods which are up and running (All the pods have their associated pvc's).
Could you please provide the required steps for such a scenario to fix the rebooting pod state?
What did you expect to happen?
After performing the db fix for restarting pod, it should be able to up and running with the latest known state of last 2 pods state
How can we reproduce it (as minimally and precisely as possible)?
In HA setup of 3 pods, we can corrupt the db and reboot the pod, if its in restart loop we should be able to get the above issue
Anything else we need to know?
As we dont have the snapshots backups, it is ok if the data is lost for that particular time frame of the pod reboot state, and we should be able to get the last known state as of other 2 pods.
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
Beta Was this translation helpful? Give feedback.
All reactions