-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay the next campaign if the node lost the vote #169
base: main
Are you sure you want to change the base?
Conversation
I think a better approach would be to verify if the rejection is due to outdated logs before delaying the next request.This approach may potentially increase the downtime of the cluster. |
Yes, I was thinking the same. But unfortunately, we don't have such info in the
It should be fine. It just delays current node's next campaign. It doesn't change other nodes' campaign timings. Also the |
@Cjen1 I think this PR can also resolve what your PR #86 is trying to fix, and with minimum change. Please take a look and let me know if you have any comment. thx. @erikgrinaker @pav-kv PTAL |
34ffcfd
to
f62aed5
Compare
@ahrtr Are you able to reproduce the problem in etcd-io/etcd#17455 in a datadriven test (the tests in |
This change seems to be concerned with the situation when a candidate loses because of having an outdated log. However, the fix may affect the situation of a split vote. When two candidates duel and both lose campaign because of a split vote, both will bump the election timeout, and the election may stall for a bit. It appears that there is no silver bullet against all these situations. Enabling |
Yes. This PR is fixing the use cases where PreVote isn't enabled at all.
The idea is to increase the current node's Let's work with an example. Assuming a cluster has 3 nodes, and one of the node (say node A) is lag behind other nodes. When they start to campaign, the node A has a possibility of 33% to start the campaign before other nodes. Its vote will definitely be rejected by other nodes. Then all the 3 nodes will conduct the next round of election fairly. The node A still has a possibility of 33% to start the campaign before other nodes. Eventually it may lead to a long time for a real leader to get elected. It's exactly this PR tries to fix. Note that
Yes, it's true. But I imagine that the possibility of running into such split vote should be far less than 33% mentioned above. Usually we recommend odd number of members in a cluster. But it isn't verified yet. Ideally we should a tool to measure the average recovery time, something like what #86 does.
The datadriven test might not be a good choice to reproduce & verify the fix. The PR doesn't change the logic, so it hasn't any impact on correctness. |
Even in split vote, we do not set all nodes' |
The datadriven test is a good way, IMO. It clearly shows the behaviour. Yes, we're not changing correctness, but we're changing the behaviour. The datadriven harness gives ways to emulate lagging and duelling campaigns, you can cause specific message delivery ordering that repro the problem. There is also a command for setting specific election timeouts which might be useful here. To demonstrate the effect of this PR, it would be good to 1) in the first commit add a test with the election rejection scenario you're targetting, 2) fix it in the second commit - the diff in the
I'm still not getting the problem. Say, all the nodes campaign in the |
Note this PR is fixing the cases where preVote isn't enabled at all. Each time when any node campaign, it will increase its term by 1. Other nodes will be converted to a follower when receiving the MsgVote. It means even a node lost its vote, all nodes' Line 1094 in d475d7e
Lines 774 to 777 in d475d7e
|
@ahrtr Just checking my intuition here. Given there is a lagging node that calls and fails to be elected, you want to bias the retried election to the other replicas? I think some issues arise when there is asynchrony or the election timeout is roughly the same order of magnitude as the round-trip-time. Specifically this means that before receiving the next For context I've been running some tests in cloudlab in a 3 node cluster with ~50ms latency between nodes. |
@ahrtr I see. You're saying that the nodes who rejected the vote, will become followers and reset the timeout. So my assumption that t4 > t2 and > t3, is incorrect? Then it's possible that the short log A campaigns first again. I wonder though if, alternatively to the fix in this PR, we should just not reset the election timeout in the nodes who reject a vote and becomeFollower(leader=None). Then they will campaign earlier than the rejected node A tries again. |
Yes, it's true. It's a real issue even without the change in this PR. It's a misconfiguration of the election timeout. Usually we recommend the heartbeat interval to be roughly the round-trip time between members, and the election timeout to be the 10 times of heartbeat interval/timeout. Please refer to https://etcd.io/docs/v3.5/tuning/#time-parameters
Correct.
it's exactly my original thought. Please see #167.
|
Yes. Normally the randomized election timeout is in range But it's only within one term. All the time parameters will be reset in next term. |
Is it possible to add whether it was rejected because of the log? It feels like there are very few things to add to MsgVoteResp.And it can stop in this raft activity, because it cannot become a leader unless it receives the appendMsg and gets the log. This may also be an optimization point. I feel that the original raft’s expression of reject is not perfect.Because rejected by high terms and logs, there is a little difference between them. |
f62aed5
to
702e99a
Compare
Added a data-driven test case. @pav-kv PTAL, suggest to review it commit by commit. @Cjen1 Please also take a look if you have time. Thanks both. Also responding to your previous comment on split vote, actually the vote result in split vote is
No such plan in this PR for now. Also note that one peer's rejection doesn't mean current node will have to necessarily lose the campaign. |
at the majority of nodes |
This isn't correct. For split vote, the vote result is |
861120d
to
1e3047d
Compare
@erikgrinaker @pav-kv PTAL when you have time. @Cjen1 do you have any immediate concern? Thanks. |
Signed-off-by: Benjamin Wang <[email protected]>
Signed-off-by: Benjamin Wang <[email protected]>
Signed-off-by: Benjamin Wang <[email protected]>
Delay the next campaign if the node lost the vote. It's highly likely it will also lose next campaign, so it makes more sense to prioritize campaigns by other nodes within the current term. Signed-off-by: Benjamin Wang <[email protected]>
…campaign When a node lost the vote, it will be de-prioritized on the next campaign. Its randomized election timeout is in range [2*electionTime, 3*electionTime). Signed-off-by: Benjamin Wang <[email protected]>
I'm concerned with a duelling leaders situation. Can we have a second
Maybe there is a 2* multiplier in this reasoning. But the idea is: wouldn't this fix introduce an extra |
I am still concerned about this impacting the minimal viable recovery time, since if leaders duel it will double the length of each iteration (afaict). I fully understand that this isn't a massive concern since the recommendation is to minimize any erroneous elections. Is there an easy way for me to build a binary to test it in reckon? Otherwise if you swap out the etcd binary in reckon that will also work. (Additionally I'm on holiday until Monday so have limited time to test this until then) |
It's true if there are only two nodes in the cluster and the two nodes are campaigning at exactly the same time.
Is it enough to provide an etcd binary on top of this PR? I am happy to help on the test. |
I've run this PR through reckon, using 100 leader loss events for each config, varying the number of nodes (3 and 5) and the election timeout. I think the result is that this PR has no significant effect on the average time to recovery of etcd. |
I share @pav-kv's concerns in #169 (comment), and also the concerns voiced in #167. Spurious election disruption is a known weakness when not using pre-vote and checkquorum, and it sounds like enabling those would adequately address etcd-io/etcd#17455. I think we should recommend employing those as a first measure, and only tweak election timeout policies when we see significant problems with those enabled. |
Thanks for the test result. My dev branch is based on etcd release-3.5, while your baseline is based on release-3.4. It might be another reason for the subtle change.
The PR is trying to mitigate such kind of known weakness. Not all applications enable pre-vote. Note etcd release-3.4 is a stable release, it has never enabled pre-vote. As mentioned above #169 (comment), a lag node always has 33% possibility (in a 3-node cluster) to start the campaign before other nodes without this PR. The possibility is 50% if there are only two available nodes. But the possibility of split vote in a 3 node cluster is below based on raft paper 9.2:
Obviously 33% vs 3-10%, or 50% vs 18-35%, this PR improves the situation. |
Delay the next campaign if the node lost the vote. It's highly likely it will also lose next campaign, so it makes more sense to prioritize campaigns by other nodes within the current term.
Alternative to resolve etcd-io/etcd#17455. It has better readability than #167.