-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept any snapshot that allows replication #110
Conversation
cc @pavelkalinnikov @ahrtr |
827c023
to
7c4d93a
Compare
I am curious about the design. In raft, data only flows outwards from the leader to other servers. But it seems that data could also flow from a follower to another follower in CockroachDB? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erikgrinaker Thanks for driving this change. Left some improvements suggestions and nits. Looks like there will be some follow-up work depending on what we pick in this PR.
tracker/progress.go
Outdated
// The follower will transition back to StateReplicate if the leader | ||
// receives an MsgAppResp from it at or above PendingSnapshot, or if | ||
// ResumeReplicateBelowPendingSnapshot is enabled, one that reconnects the | ||
// follower to the leader's log (such an MsgAppResp is emitted when the | ||
// follower applies a snapshot). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more complicated than it needs to be, hence bugs. For example, consider this part of the comment: The follower will transition back to StateReplicate if the leader receives an MsgAppResp from it at or above PendingSnapshot
. What if by the time we receive a reply >= PendingSnapshot
, the log is already truncated beyond this index? Then it's pointless to enter the StateReplicate
, and there will be another roundtrip to/from StateSnapshot
.
I think we should frame A->B state transitions in terms of the pre-conditions/invariants of state B, and only based on that elaborate the messages in state A that can induce these transitions.
In this case we should aim for something like:
// The leader transitions the follower to StateReplicate upon receiving a
// confirmation that there is no gap between the follower's and the leader's
// log, i.e. when Match >= FirstIndex-1.
//
// In StateSnapshot and StateProbe this generally happens when the leader
// receives a successful MsgAppResp at Index >= FirstIndex-1.
If we frame it this way, then we don't even need to compare against the PendingSnapshot
to decide whether to unblock, and we don't need the PendingSnapshot
variable in the first place (except maybe for informational purposes).
The only other usage of the PendingShapshot
field is this:
Line 114 in d70f523
pendingSnapshot := pr.PendingSnapshot |
but we could do better without it.
BecomeProbe
assumes that the PendingSnapshot
is / will be applied, and optimistically sets Next
to after it. It also seemingly assumes that the caller of BecomeProbe
knows what they're doing, and that PendingSnapshot
is exactly that snapshot that succeeded. We could do all that with just the Next
variable (set Next = index + 1
upon sending a snapshot @ index
and entering the StateSnapshot
).
The pending snapshot API is a bit troublesome because it assumes there is only one snapshot in flight, and it also puts some unspecified responsibility on the ReportSnapshot
caller. We should not assume that ReportSnapshot
reports the snapshot at PendingSnapshot
index, but rather allow the caller to pass the index in to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are likely improvements we can make here, and this is also related to cockroachdb/cockroach#87583.
But I'm inclined to make a targeted change for now, gated behind a config flag, and once we're confident in the change and enable it unconditionally we can consider a broader refactor.
// TODO(tbg): this code is very similar to the snapshot handling in | ||
// MsgAppResp above. In fact, the code there is more correct than the | ||
// code here and should likely be updated to match (or even better, the | ||
// logic pulled into a newly created Progress state machine handler). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO wasn't true, because the code is quite different (we don't check Match/Next
indices here whatsoever, or communicate the Index
of the snapshot that succeeded - we should!), but I think something still needs to be done here. We should be able to transition from StateSnapshot
straight to StateReplicate
if the conditions allow for it (see other comments: the condition is that the follower's Match
is re-connected to the leader's log). And we should share the same transition code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't check Match/Next indices here whatsoever, or communicate the Index of the snapshot that succeeded - we should!
Yeah, I think any better logic here is predicated on passing in the snapshot index. See cockroachdb/cockroach#87583.
We should be able to transition from StateSnapshot straight to StateReplicate if the conditions allow for it (see other comments: the condition is that the follower's Match is re-connected to the leader's log).
Without the snapshot index being passed in, I don't think we can know the Match until we receive an MsgAppResp, so moving to StateProbe seems like the right thing to do until that happens.
That's right, snapshots can be sent from any replica via delegated snapshots. For example, cross-region traffic is often much more expensive than inter-region traffic, so in multi-region clusters we prefer sending snapshots from other replicas in the same region where possible. This commonly happens when moving a replica from one CRDB node to a different node in the same region due to rebalancing. |
9b71e68
to
226ef1c
Compare
thx for the info. I understood the motivation now. But the implementation in this PR is a little strange to me. The behaviour of delegating snapshot is completely outside of raft's control, instead it's just application's behaviour (CockroachDB in this case), but this PR modifies raft to adapt to the behaviour which is outside of raft. Is it possible to reset the follower's state to One more not-directly-related comment, once the flag |
This behaviour will Raft code internally assumes that there is always at most one snapshot in flight. This only seems to be true. Even if raft is used the standard way, I'm moderately sure this property can be violated. Raft messages can be duplicated and reordered, and the process can be restarted. Consider the following scenario (which can be driven entirely by raft, without delegated snapshots):
With this observation in mind, the scenarios similar to what this PR fixes, can actually happen for any user. This PR is therefore a strict improvement. The question of what's raft- and what's application- specific is tricky. In my opinion, the
The only difference between this PR and #123 is the state to which we transition into. This PR translates to Could you explain why choose transition to I'll explain why transitioning to
From these definitions, it makes sense to transition to As a side note, I think we need to clean up the flow state machine and make these invariants verbose. |
For easing the mental model, think about a snapshot message as a From this standpoint, we should treat snapshot message sends and replies just the same way we treat |
CRDB drops the The delegated snapshot is initiated when the leader sees the There are also other cases where this can happen -- for example:
Yeah, I think this can probably happen across leader changes too -- see above. I can write up a test case to confirm.
I agree with this. We're processing an
I'm not particularly attached to this option. We could merge it as-is without the option, and consider a revert or option if we discover any problems with it. There shouldn't be any correctness or availability implications, just possibly additional work in rare cases. |
This isn't the case, because the follower will reject an
No, I think that was simply an eager initial snapshot that was sent before the leader transitioned to |
Yeah, then the scenario in my comment can't happen either. That's lucky, but unfortunate in some sense too. It doesn't matter which leader term sent a snapshot/append: as long as a snapshot/append "happens after" the follower's state and is recent, it should be ok to accept. E.g. if during a term change an append from previous term is slightly delayed, it still likely adheres to append-only properties etc, so with some safety checks it can be applied, and retries avoided. This gives some room for optimizations, and what enables it is moving safety checks down the stack and relaxing dependency on the "always reject non-matching term" behaviour. Anyway, my point is still that this PR (and, in broader sense, formalizing and cleaning up the |
The main concern is about adding public user-facing flag. Usually when we add a public flag for new features, e.g. But this flag I agree that the comment #110 (comment) makes more sense. I like the idea of changing state based on invariants. If we make some enhance and add a flag for it, then it makes more sense. From implementation perspective, I agree this PR is safe, because the behaviour will keep unchanged as long as users do not enable
The concerning isn't about guaranteeing at most one snapshot in flight, it's about waste of bandwidth, because a snapshot may be huge in size, e.g a couple of GBs. We should try to avoid sending snapshot as much as possible. [If CockRoachDB enable
Based on current raft protocol/implementation, when a follower is in |
This adds a test that documents the following behavior: It the leader is tracking a follower as StateSnapshot with PendingSnapshot equal to, say, 100, and the follower applies a snapshot at a lower index that reconnects the follower to the leader's log, then the leader will still ignore this snapshot. Signed-off-by: Erik Grinaker <[email protected]> Signed-off-by: Tobias Grieger <[email protected]>
226ef1c
to
f72389a
Compare
I've removed the option.
Yes, that's also the motivation for us. In CRDB, we have already sent a snapshot to the follower (below PendingSnapshot), but the leader rejects this snapshot and insists we send a new one above PendingSnapshot. We would like to avoid this.
Well, from etcd/raft's point of view, even if we did receive an out-of-date MsgAppResp that was below PendingSnapshot but above the leader's first log index, then the follower must have advanced in the meanwhile (or the leader's log storage was temporarily unavailable). If the follower has advanced and recovered from needing a snapshot, it seems fine to move it back to StateReplicate. With StateProbe, we'll eat the cost of an MsgApp roundtrip only to discover that Next: PendingSnapshot + 1 was incorrect, then resume replication at the correct index. Is the motivation for using StateProbe rather than StateReplicate to avoid replicating entries below PendingSnapshot that the follower may already have received from the snapshot in the case of an MsgAppResp race? Are there any other downsides to using StateReplicate? To be clear, I'd be fine with StateProbe here -- it's also what we use with e.g. |
Actually, that can't really happen, because the snapshot does not carry the actual log entries, and may have truncated the log. It's possible that we'll replicate log entries that will be clobbered by the later snapshot, thus wasting work, but that can also happen in the StateProbe case. |
Overall looks good, thx. Note that I am planning to release raft 3.6.0, please see #89. Please feel free to add comment under that issue on whether you are going to resolve #110 (comment) in 3.6.0 or next release e.g. 4.0.
No for now. |
I think we'll have to do the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Leave to @pav-kv to take a second look
A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR makes that change. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <[email protected]> Signed-off-by: Tobias Grieger <[email protected]>
f72389a
to
d87942f
Compare
Moved follow-up work to #124 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
// TODO(tbg): we should also enter this branch if a snapshot is | ||
// received that is below pr.PendingSnapshot but which makes it | ||
// possible to use the log again. | ||
case pr.State == tracker.StateSnapshot && pr.Match+1 >= r.raftLog.firstIndex(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically, it can also be (maybe should?) pr.Next >= r.raftLog.firstIndex()
.
Next
is the next index to send. Indices between Match+1
and Next-1
are in-flight (by invariant which we haven't formalized yet), so don't necessarily need to be present in the log.
The MaybeUpdate()
call 4 lines above should set Next
to be at least Match+1
, so this should be safe.
Feel free to leave this for #124.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it for #124. Semantically, Match+1
seems better since that's where the leader's and follower's logs match, but they're equivalent here and I don't have a particularly strong opinion either way.
This PR is adopted from #84.
A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR makes that change.
In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight.
Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.
Signed-off-by: Erik Grinaker [email protected]
Signed-off-by: Tobias Grieger [email protected]