Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Membership] Use an expander graph to improve eviction speed when multiple hosts fail simultaneously #9301

Merged

Conversation

ReubenBond
Copy link
Member

@ReubenBond ReubenBond commented Jan 29, 2025

This PR changes how silos select which other silos to monitor. The current scheme arranges all silos into a hash ring, and each silo monitors the NumProbedSilos subsequent silos in the ring. This works well when there are very simultaneous failures, but when multiple silos fail simultaneously, the current scheme can be slow to detect the failures. To understand why, consider this example ring:
image
The 3 red silos, A, B, and C have failed but have not been evicted yet. If two votes are required to evict a silo, then before C can be evicted either A or B must be evicted first. Note that for each silo, there exists a silo whose monitored set differs by a single silo. We can improve detection in this correlated failure scenario by selecting monitored silos using an expander graph instead.

This PR implements that approach by probabilistically constructing an expander graph. This minimizes overlap in monitoring sets between any two silos, thus helping to avoid cases where one failed silo must be evicted before another failed silo will be monitored by enough silos to have it evicted.

The idea to use an expander graph is taken from "Stable and Consistent Membership at Scale with Rapid" by Lalith Suresh et al:
https://www.usenix.org/conference/atc18/presentation/suresh

Microsoft Reviewers: Open in CodeFlow

@ReubenBond ReubenBond force-pushed the fix/disaster-recovery/use-expander-graph branch from 952413c to 38435a6 Compare February 5, 2025 19:32
@ReubenBond ReubenBond merged commit e911c48 into dotnet:main Feb 5, 2025
16 checks passed
@ReubenBond ReubenBond deleted the fix/disaster-recovery/use-expander-graph branch February 5, 2025 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant