Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spreading KVs for half of the dead node grace period. #114

Merged
merged 1 commit into from
Feb 15, 2024

Conversation

fulmicoton
Copy link
Contributor

@fulmicoton fulmicoton commented Feb 13, 2024

See README for more information.

This change is made because we use chitchat as a reliable broadcast to
update published position in Quickwit.

@fulmicoton fulmicoton force-pushed the transmitting-dead-node-kvs branch 4 times, most recently from db503f9 to 488820c Compare February 13, 2024 08:53
@fulmicoton fulmicoton requested a review from guilload February 13, 2024 08:55
@fulmicoton fulmicoton force-pushed the transmitting-dead-node-kvs branch from 488820c to b2d0765 Compare February 13, 2024 08:57
@fulmicoton fulmicoton changed the title Transmitting dead node kvs Spreading KVs for half of the dead node grace period. Feb 13, 2024
@fulmicoton fulmicoton force-pushed the transmitting-dead-node-kvs branch 3 times, most recently from 166dd41 to c349456 Compare February 13, 2024 13:15
Copy link
Member

@guilload guilload left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that the code remains difficult to read despite your explanations in the README.

I would go with the following naming/explanations.

Node deletion

  • Heartbeats are fed into a phi-accrual detector.
  • Detector tells live nodes from failed nodes apart.
  • Failed nodes are GCed after GC_GRACE_PERIOD.

Reliable broadcast

  • In order to ensure reliable broadcast, we must propagate info about failed nodes for some time shorter than GC_GRACE_PERIOD before deleting them.
  • To do so, failed nodes are split into two categories: zombie and dead.
  • First, upon failure, failed nodes become zombie nodes, and we keep sharing data about them.
  • After ZOMBIE_GRACE_PERIOD, zombie nodes transition to dead nodes, and we stop sharing data about them.
  • ZOMBIE_GRACE_PERIOD is set to GC_GRACE_PERIOD / 2

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved

The chitchat library does not include any mechanism to prevent this from happening. They should however eventually get deleted (after a bit more than `DEAD_NODE_GRACE_PERIOD`) if the node is really dead.

If the node is alive, it should be able to fix everyone's state via reset or regular delta.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, chitchat is so specific to Quickwit, I'd rather move it to quickwit-gossip inside quickwit. Updates will be easier. Same for mrecorlog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine outside of quickwit: we don't modify it much and I like that it is really uncoupled.
Also, it is surprisingly used by a couple of external projects.

In the future I'd love to improve it, and make it into a protocol to replicate pure operation based Crdts.
Quickwit would get cleaner.

@fulmicoton fulmicoton force-pushed the transmitting-dead-node-kvs branch from 633fb6a to 9582399 Compare February 15, 2024 05:34
See README for more information.

This change is made because we use chitchat as a reliable broadcast to
update published position in Quickwit.
@fulmicoton fulmicoton force-pushed the transmitting-dead-node-kvs branch from 9582399 to 73fd3e2 Compare February 15, 2024 05:46
@fulmicoton fulmicoton merged commit 32459cd into main Feb 15, 2024
2 checks passed
@fulmicoton fulmicoton deleted the transmitting-dead-node-kvs branch February 15, 2024 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants