Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If zookeeper is not running, then decreasing replicas will make cluster of zookeeper Unrecoverable #398

Open
stop-coding opened this issue Oct 12, 2021 · 2 comments · May be fixed by #406

Comments

@stop-coding
Copy link

stop-coding commented Oct 12, 2021

Description

when a cluster of zookeeper is not running for some error, then decreasing replicas will delete pod automatically.
pod exec zookeeperTeardown.sh to connect zookeeper will fail ,then it will not remove node on this cluster of zookeeper.
But the pod have been delete without update zookeeper configure on zoo.cfg

Importance

must-have

Location

ZNODE_PATH="/zookeeper-operator/$CLUSTER_NAME"
CLUSTERSIZE=java -Dlog4j.configuration=file:"$LOG4J_CONF" -jar /root/zu.jar sync $ZKURL $ZNODE_PATH
echo "CLUSTER_SIZE=$CLUSTERSIZE, MyId=$MYID"
if [[ -n "$CLUSTERSIZE" && "$CLUSTERSIZE" -lt "$MYID" ]]; then
java -Dlog4j.configuration=file:"$LOG4J_CONF" -jar /root/zu.jar remove $ZKURL $MYID
echo $?
fi

Suggestions for an improvement

fix on zookeepercluster_controller.go:reconcileStatefulSet
If ClusterSize decrease, do reconfig remove here

@anishakj
Copy link
Contributor

@stop-coding Could you please let us know how to reproduce this issue. After this once, if zookeeper starts running is the replica set not updated correctly?

@stop-coding
Copy link
Author

stop-coding commented Oct 13, 2021

@stop-coding Could you please let us know how to reproduce this issue. After this once, if zookeeper starts running is the replica set not updated correctly?

@anishakj Thank you for attention.

For example:

  1. Create an cluster that size is 3.
  2. Wait all pod running, named: zk-0\zk-1\zk-2.
  3. Delete zk-1\zk-2 pod, make cluster of zookeeper unable to provide services.
  4. "kubectl edit zk" that change replicas to 1
  5. Wait some time, replicas will decrease to 1.
  6. Now, zk-0 will not running forever until editing zoo.cfg correctly.

I think removing a pod needs to ensure atomicity, that include connect success and reconfig success...
Do you have a better suggestion?

@stop-coding stop-coding changed the title if zookeeper is not running, then decreasing replicas will make cluster of zookeeper Unrecoverable If zookeeper is not running, then decreasing replicas will make cluster of zookeeper Unrecoverable Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment