-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loadbalancer: Add a P2C host selector #2743
Conversation
import static java.util.concurrent.atomic.AtomicReferenceFieldUpdater.newUpdater; | ||
import static java.util.stream.Collectors.toList; | ||
|
||
final class DefaultHost<Addr, C extends LoadBalancedConnection> implements Host<Addr, C> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the exception of implementing address()
, marking the methods public and @Override
, this should just be a rename.
final boolean t1Healthy = t1.isActiveAndHealthy(); | ||
final boolean t2Healthy = t2.isActiveAndHealthy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this follows the pattern set by the RoundRobinSelector
which is to consider health only for making new connections. I think that is born of the only current circuit breaker being the L4 consecutive failures and the current hesitancy to mark a connection as inactive.
I think we should change that behavior in that we should consider health for any connections and also allow an unhealthy connection to be selected as a last resort (likely configurable). This allows us to have the same expected behavior for all circuit breakers and we don't need to special case the L4. Additionally, I think the L4 case is even improved because if we can't connect to a host there is a good chance existing connections are either not responsive or draining.
However, that is a reasonably big reinterpretation of what health means. I'd like to change it for RR and P2C at the same time, or maybe switch RR first and then merge this. Feedback appreciated on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on that - RR is a good battle tested source for feedback, but if we are designing building blocks of a LB abstraction we should re-consider some of the design decisions and evaluate tradeoffs.
My mental model about this was
- List of hosts as seen from SD
- Subset of hosts as identified by any feedback controller (e.g., L4 / L7)
- Select among these
- Fallback strategy (default to what RR is doing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think we can potentially go further than RR and offer a panic mode, maybe in two flavors: existing connections only and even for new connections, that at least lets the LB try. At least for the latter it's probably important to add some restrictions at the connection factory level to avoid connection storms so we should proceed with caution there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for improving and interating on this area and shaping definition of "unhealthy" state.
if we can't connect to a host there is a good chance existing connections are either not responsive or draining
In order to keep track of use-cases and make sure we don't miss anything, it's worth mentioning that ServerListenContext
API allows users to pause acceptance of new connections without affecting pre-existing connections. We should account for that as well.
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/Host.java
Outdated
Show resolved
Hide resolved
* Whether the host is both considered active by service discovery and healthy by the failure | ||
* detection mechanisms. | ||
*/ | ||
boolean isActiveAndHealthy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a comment for this PR but the overall effort.
I think we can provide a richer API that closer resembles what a control plane could also advertise about an endpoint (ie. Condition
enum -- Terminating, NotReady, Ready) and a separate binary data plane status (healthy / unhealthy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I had some thoughts about already but agree that we should change it in it's own PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree as well. To add some context from xDS/EDS we definitely have this information available. "Health" from the perspective of the protocol is richer than a binary decision so it could be an interesting exercise understanding if we care.
final boolean t1Healthy = t1.isActiveAndHealthy(); | ||
final boolean t2Healthy = t2.isActiveAndHealthy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on that - RR is a good battle tested source for feedback, but if we are designing building blocks of a LB abstraction we should re-consider some of the design decisions and evaluate tradeoffs.
My mental model about this was
- List of hosts as seen from SD
- Subset of hosts as identified by any feedback controller (e.g., L4 / L7)
- Select among these
- Fallback strategy (default to what RR is doing)
// try to make a new one if the host is healthy. If it's not healthy, we fail | ||
// and let the higher level retries decide what to do. | ||
if (!host.isActiveAndHealthy()) { | ||
return noActiveHostsException(hosts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this isn't an exception it's a failed Single
. I might expect return noActiveHosts(hosts)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't see this until I had merged. I'll clean it up in a follow up.
The failing checks were flaky tests, so merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, have 2 questions:
} | ||
} | ||
|
||
@Nullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method always returns a non-null value, this annotation is not required
@Nullable | ||
private Single<C> p2c(int size, List<Host<ResolvedAddress, C>> hosts, Random random, Predicate<C> selector, | ||
boolean forceNewConnectionAndReserve, @Nullable ContextMap contextMap) { | ||
for (int j = maxEffort; j > 0; j--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it make sense to limit efforts based on number of hosts? For example, if there are only 2 hosts with maxEffort > 1
, should we try again or fail fast?
if (t1Healthy) { | ||
return t1.newConnection(selector, forceNewConnectionAndReserve, contextMap); | ||
} else if (t2Healthy) { | ||
return t2.newConnection(selector, forceNewConnectionAndReserve, contextMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you clarified reasoning to move re-iteration in a separate big redesign, will it make sense for the time being to follow the following logic?
- Try
t1.pickConnection
, if not - Check t1 health, if ok do
t1.newConnection
if not: - Try
t2.pickConnection
, if not - Check t2 health, if ok do
t1.newConnection
if not: - go to the next effort
Motivation
We have a round-robin selector but also want to support
a form of random selection. P2C is a common random
selection algorithm that gives us the ability to also distinguish
hosts based on a load metric, or score, which can help us
distribute load more evenly and avoid hosts that can be
considered outliers by a variety of mechanisms.
Modifications
Host
an interface and rename the existing versionto
DefaultHost
to make testing easier and the contractexplicit. This doesn't mean the contract won't change as we
continue to iterate.
P2CSelector
type and it's tests.Note: we are not yet integrating this selection algorithm into the
NewRoundRobinLoadBalancer
which will entail a bunch ofquestions around the API for doing so. That will be done as a
followup to minimize the patch size.