-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spreading KVs for half of the dead node grace period. #114
Conversation
db503f9
to
488820c
Compare
488820c
to
b2d0765
Compare
166dd41
to
c349456
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that the code remains difficult to read despite your explanations in the README.
I would go with the following naming/explanations.
Node deletion
- Heartbeats are fed into a phi-accrual detector.
- Detector tells
live
nodes fromfailed
nodes apart. - Failed nodes are GCed after GC_GRACE_PERIOD.
Reliable broadcast
- In order to ensure reliable broadcast, we must propagate info about
failed
nodes for some time shorter thanGC_GRACE_PERIOD
before deleting them. - To do so,
failed
nodes are split into two categories:zombie
anddead
. - First, upon failure,
failed
nodes becomezombie
nodes, and we keep sharing data about them. - After
ZOMBIE_GRACE_PERIOD
,zombie
nodes transition todead
nodes, and we stop sharing data about them. - ZOMBIE_GRACE_PERIOD is set to GC_GRACE_PERIOD / 2
|
||
The chitchat library does not include any mechanism to prevent this from happening. They should however eventually get deleted (after a bit more than `DEAD_NODE_GRACE_PERIOD`) if the node is really dead. | ||
|
||
If the node is alive, it should be able to fix everyone's state via reset or regular delta. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, chitchat is so specific to Quickwit, I'd rather move it to quickwit-gossip
inside quickwit
. Updates will be easier. Same for mrecorlog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine outside of quickwit: we don't modify it much and I like that it is really uncoupled.
Also, it is surprisingly used by a couple of external projects.
In the future I'd love to improve it, and make it into a protocol to replicate pure operation based Crdts.
Quickwit would get cleaner.
633fb6a
to
9582399
Compare
See README for more information. This change is made because we use chitchat as a reliable broadcast to update published position in Quickwit.
9582399
to
73fd3e2
Compare
See README for more information.
This change is made because we use chitchat as a reliable broadcast to
update published position in Quickwit.