Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loadbalancer: Add a P2C host selector #2743

Merged
merged 5 commits into from
Nov 9, 2023

Conversation

bryce-anderson
Copy link
Contributor

@bryce-anderson bryce-anderson commented Nov 7, 2023

Motivation

We have a round-robin selector but also want to support
a form of random selection. P2C is a common random
selection algorithm that gives us the ability to also distinguish
hosts based on a load metric, or score, which can help us
distribute load more evenly and avoid hosts that can be
considered outliers by a variety of mechanisms.

Modifications

  • Make Host an interface and rename the existing version
    to DefaultHost to make testing easier and the contract
    explicit. This doesn't mean the contract won't change as we
    continue to iterate.
  • Add the P2CSelector type and it's tests.

Note: we are not yet integrating this selection algorithm into the
NewRoundRobinLoadBalancer which will entail a bunch of
questions around the API for doing so. That will be done as a
followup to minimize the patch size.

import static java.util.concurrent.atomic.AtomicReferenceFieldUpdater.newUpdater;
import static java.util.stream.Collectors.toList;

final class DefaultHost<Addr, C extends LoadBalancedConnection> implements Host<Addr, C> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the exception of implementing address(), marking the methods public and @Override, this should just be a rename.

Comment on lines +120 to +121
final boolean t1Healthy = t1.isActiveAndHealthy();
final boolean t2Healthy = t2.isActiveAndHealthy();
Copy link
Contributor Author

@bryce-anderson bryce-anderson Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this follows the pattern set by the RoundRobinSelector which is to consider health only for making new connections. I think that is born of the only current circuit breaker being the L4 consecutive failures and the current hesitancy to mark a connection as inactive.
I think we should change that behavior in that we should consider health for any connections and also allow an unhealthy connection to be selected as a last resort (likely configurable). This allows us to have the same expected behavior for all circuit breakers and we don't need to special case the L4. Additionally, I think the L4 case is even improved because if we can't connect to a host there is a good chance existing connections are either not responsive or draining.
However, that is a reasonably big reinterpretation of what health means. I'd like to change it for RR and P2C at the same time, or maybe switch RR first and then merge this. Feedback appreciated on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on that - RR is a good battle tested source for feedback, but if we are designing building blocks of a LB abstraction we should re-consider some of the design decisions and evaluate tradeoffs.
My mental model about this was

  • List of hosts as seen from SD
  • Subset of hosts as identified by any feedback controller (e.g., L4 / L7)
  • Select among these
  • Fallback strategy (default to what RR is doing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think we can potentially go further than RR and offer a panic mode, maybe in two flavors: existing connections only and even for new connections, that at least lets the LB try. At least for the latter it's probably important to add some restrictions at the connection factory level to avoid connection storms so we should proceed with caution there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for improving and interating on this area and shaping definition of "unhealthy" state.

if we can't connect to a host there is a good chance existing connections are either not responsive or draining

In order to keep track of use-cases and make sure we don't miss anything, it's worth mentioning that ServerListenContext API allows users to pause acceptance of new connections without affecting pre-existing connections. We should account for that as well.

@bryce-anderson bryce-anderson changed the title Bl anderson/add p2c loadbalancer: Add a P2C selector type for the new load balancing architecture Nov 7, 2023
@bryce-anderson bryce-anderson changed the title loadbalancer: Add a P2C selector type for the new load balancing architecture loadbalancer: Add a P2C host selector Nov 7, 2023
* Whether the host is both considered active by service discovery and healthy by the failure
* detection mechanisms.
*/
boolean isActiveAndHealthy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a comment for this PR but the overall effort.
I think we can provide a richer API that closer resembles what a control plane could also advertise about an endpoint (ie. Condition enum -- Terminating, NotReady, Ready) and a separate binary data plane status (healthy / unhealthy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I had some thoughts about already but agree that we should change it in it's own PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree as well. To add some context from xDS/EDS we definitely have this information available. "Health" from the perspective of the protocol is richer than a binary decision so it could be an interesting exercise understanding if we care.

Comment on lines +120 to +121
final boolean t1Healthy = t1.isActiveAndHealthy();
final boolean t2Healthy = t2.isActiveAndHealthy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on that - RR is a good battle tested source for feedback, but if we are designing building blocks of a LB abstraction we should re-consider some of the design decisions and evaluate tradeoffs.
My mental model about this was

  • List of hosts as seen from SD
  • Subset of hosts as identified by any feedback controller (e.g., L4 / L7)
  • Select among these
  • Fallback strategy (default to what RR is doing)

// try to make a new one if the host is healthy. If it's not healthy, we fail
// and let the higher level retries decide what to do.
if (!host.isActiveAndHealthy()) {
return noActiveHostsException(hosts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this isn't an exception it's a failed Single. I might expect return noActiveHosts(hosts).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't see this until I had merged. I'll clean it up in a follow up.

@bryce-anderson
Copy link
Contributor Author

The failing checks were flaky tests, so merging.

@bryce-anderson bryce-anderson merged commit 34da727 into apple:main Nov 9, 2023
14 of 15 checks passed
@bryce-anderson bryce-anderson deleted the bl_anderson/add_p2c branch November 9, 2023 19:12
Copy link
Member

@idelpivnitskiy idelpivnitskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, have 2 questions:

}
}

@Nullable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method always returns a non-null value, this annotation is not required

@Nullable
private Single<C> p2c(int size, List<Host<ResolvedAddress, C>> hosts, Random random, Predicate<C> selector,
boolean forceNewConnectionAndReserve, @Nullable ContextMap contextMap) {
for (int j = maxEffort; j > 0; j--) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it make sense to limit efforts based on number of hosts? For example, if there are only 2 hosts with maxEffort > 1, should we try again or fail fast?

if (t1Healthy) {
return t1.newConnection(selector, forceNewConnectionAndReserve, contextMap);
} else if (t2Healthy) {
return t2.newConnection(selector, forceNewConnectionAndReserve, contextMap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you clarified reasoning to move re-iteration in a separate big redesign, will it make sense for the time being to follow the following logic?

  1. Try t1.pickConnection, if not
  2. Check t1 health, if ok do t1.newConnection if not:
  3. Try t2.pickConnection, if not
  4. Check t2 health, if ok do t1.newConnection if not:
  5. go to the next effort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants