After "error":"invalid database", etcd-0 pod is in restart loop, unable to retrieve db with the help of the other 2 running pods (without using the snapshot backup)) #15991

manish-raut · 2023-05-05T10:12:07Z

manish-raut
May 5, 2023

What happened?

We have HA setups of 3 pods. After the power failure while applying the patch, a db failure has happened getting "panic: failed to open database" error in logs for the etcd HA setups of 3 pods. The 1 pod is in a restart loop, and we are unable to retrieve the db using the snapshot backups as we don't have any. We want to retrieve the db state with the help of the other 2 pods which are up and running (All the pods have their associated pvc's).

Could you please provide the required steps for such a scenario to fix the rebooting pod state?

What did you expect to happen?

After performing the db fix for restarting pod, it should be able to up and running with the latest known state of last 2 pods state

How can we reproduce it (as minimally and precisely as possible)?

In HA setup of 3 pods, we can corrupt the db and reboot the pod, if its in restart loop we should be able to get the above issue

Anything else we need to know?

As we dont have the snapshots backups, it is ok if the data is lost for that particular time frame of the pod reboot state, and we should be able to get the last known state as of other 2 pods.

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# 3.5.5

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

leged access to the data"}

{"level":"panic","ts":"2023-04-18T08:29:23.577Z","caller":"backend/backend.go:189","msg":"failed to open database","path":"/data/mem ber/snap/db","error":"invalid database","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.newBackend\n\t/tmp/etcd-release-3.5.5/e tcd/release/etcd/server/mvcc/backend/backend.go:189\ngo.etcd.io/etcd/server/v3/mvcc/backend.New\n\t/tmp/etcd-release-3.5.5/etcd/rele ase/etcd/server/mvcc/backend/backend.go:163\ngo.etcd.io/etcd/server/v3/etcdserver.newBackend\n\t/tmp/etcd-release-3.5.5/etcd/release /etcd/server/etcdserver/backend.go:55\ngo.etcd.io/etcd/server/v3/etcdserver.openBackend.func1\n\t/tmp/etcd-release-3.5.5/etcd/releas e/etcd/server/etcdserver/backend.go:76"}
panic: failed to open database

tjungblu · 2023-05-05T10:17:48Z

tjungblu
May 5, 2023
Collaborator

The 1 pod is in a restart loop,

Given it's just one, delete its data dir (remove the PV/PVC?), remove the member and restart the pod and add it as a member again. It should sync from the existing qourum of the other two.

There's more info here: https://etcd.io/docs/v3.5/op-guide/recovery/

How are you running the etcd pods, inside k8s?

0 replies

ahrtr · 2023-05-05T10:49:28Z

ahrtr
May 5, 2023
Maintainer

Most likely the bbolt db file is corrupted again. @manish-raut is it test or production environment? Could you share the db file if possible? If not, please run the bbolt check <db_file> command.

0 replies

manish-raut · 2023-05-05T11:16:50Z

manish-raut
May 5, 2023
Author

@tjungblu Yes we are running inside k8s v1.23.6. with containerd runtime. I will try again with your mentioned steps and update but dint worked earlier when I tried. Will share the logs if it fails again.

The documentation links mostly has the steps considering backups of the snapshots.

0 replies

manish-raut · 2023-05-05T11:18:54Z

manish-raut
May 5, 2023
Author

@ahrtr in our case the file is /data/member/snap/db, I will try to run the commands you mention but sorry I won't be able to share it thought its a preprod-test environment.

0 replies

serathius · 2023-05-05T11:52:13Z

serathius
May 5, 2023
Maintainer

@manish-raut Please consider providing a redacted dump of data ./bbolt/bbolt page -all --format-value=redacted db > db.txt. It creates a text file that represents bbolt btree structure but all the data is obfuscated and represented as sha256

0 replies

manish-raut · 2023-05-10T02:23:22Z

manish-raut
May 10, 2023
Author

Hello, but sorry I cant copy or provide anything with db from that environment.
@tjungblu I followed the steps you mentioned,

delete its data dir & removed PVC
removed the member
restarted the pod
but when I check the member list after pod reboot, I can see it as "unstarted"
and so when I run the command to add the member again, getting error as:

""attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-config
uration failed due to not enough started members"}
Error: etcdserver: re-configuration failed due to not enough started members"

And the pod again gets in the restart loop

0 replies

manish-raut · 2023-05-11T08:40:37Z

manish-raut
May 11, 2023
Author

Hello, Do we have any other workaround to fix this?

0 replies

ahrtr · 2023-05-11T08:58:17Z

ahrtr
May 11, 2023
Maintainer

Hello, but sorry I cant copy or provide anything with db from that environment.

If you don't provide what we requested, then it's impossible to figure out the root cause. Previously we provided two commands, could you execute them and provide feedback?

./bbolt check <db_file>
./bbolt page -all --format-value=redacted db > db.txt

And the pod again gets in the restart loop

Please follow https://etcd.io/docs/v3.5/op-guide/runtime-configuration/#replace-a-failed-machine

0 replies

manish-raut · 2023-05-12T09:57:37Z

manish-raut
May 12, 2023
Author

Hi @ahrtr ,
While fixing at our end, we lost the database and had to do a fresh installation. Luckily, it was a pre-production test env, but now the client is asking for documenting the right steps if such error occur agains for a single pod and how ideally we should retrieve it ? So, currently we are working on reproducing it and following your steps to fix it (in scenario where there is no database snapshot backups)

0 replies

manish-raut · 2023-05-26T07:31:30Z

manish-raut
May 26, 2023
Author

Hello,
Can you please guide with the required steps to fix similar scenarios? as mentioned earlier we no more have access to the corrupted database as it is already been fixed with a fresh installation but customer wants a troubleshooting steps to fix similar case.
As earlier I have tried steps mentioned by @tjungblu after restarting the pod, the member goes in unstarted state and we are not able to proceed further.

2 replies

jmhbnz Jun 1, 2023
Maintainer

Hey @manish-raut - As referenced above, the etcd project has documentation available for replacing failed instances, refer:

serathius Jun 1, 2023
Maintainer

Also https://etcd.io/docs/v3.5/op-guide/data_corruption/#restoring-a-corrupted-member

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After "error":"invalid database", etcd-0 pod is in restart loop, unable to retrieve db with the help of the other 2 running pods (without using the snapshot backup)) #15991

{{title}}

paste your configuration here

Replies: 10 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

After "error":"invalid database", etcd-0 pod is in restart loop, unable to retrieve db with the help of the other 2 running pods (without using the snapshot backup)) #15991

manish-raut May 5, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

Replies: 10 comments · 2 replies

tjungblu May 5, 2023 Collaborator

ahrtr May 5, 2023 Maintainer

manish-raut May 5, 2023 Author

manish-raut May 5, 2023 Author

serathius May 5, 2023 Maintainer

manish-raut May 10, 2023 Author

manish-raut May 11, 2023 Author

ahrtr May 11, 2023 Maintainer

manish-raut May 12, 2023 Author

manish-raut May 26, 2023 Author

jmhbnz Jun 1, 2023 Maintainer

serathius Jun 1, 2023 Maintainer

manish-raut
May 5, 2023

Replies: 10 comments 2 replies

tjungblu
May 5, 2023
Collaborator

ahrtr
May 5, 2023
Maintainer

manish-raut
May 5, 2023
Author

manish-raut
May 5, 2023
Author

serathius
May 5, 2023
Maintainer

manish-raut
May 10, 2023
Author

manish-raut
May 11, 2023
Author

ahrtr
May 11, 2023
Maintainer

manish-raut
May 12, 2023
Author

manish-raut
May 26, 2023
Author

jmhbnz Jun 1, 2023
Maintainer

serathius Jun 1, 2023
Maintainer