Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] "cluster-aware" rate limiting approximation inside the inference API #117505

Conversation

timgrein
Copy link
Contributor

@timgrein timgrein commented Nov 25, 2024

POC for "cluster-aware" rate limiting approximation inside the inference API (I didn't think about good abstractions or edge cases, so take this with a grain of salt; this is just to get the main idea across).

The current rate limiting mechanism for inference endpoints works like the following: You specify a rate limit (or the default one is used) as requests_per_minute. We count the requests using the RateLimiter, which implements the token bucket algorithm.

The problem is that counting does not use "cluster-wide" state, but "local node" state meaning that each node performs its own request counting. Assuming a ~ uniform distribution of requests the cluster-wide rate limit is effectively |nodes| * requests_per_minute. As we need a cluster-wide rate limit for our inference service we want to approximate a "cluster-wide" rate limit by dividing the specified rate limit by the number of nodes.

The rate limit update happens as soon as a node joins or leaves the cluster.

I couldn't figure out how to capture logs of newly started nodes yet, but I see the correct log messages coming from the UpdateRateLimitsClusterService:

Node added to cluster:

...
  1> [2024-11-26T13:38:07,923][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Number of nodes in the cluster: 2
  1> [2024-11-26T13:38:07,923][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Updating rate limits for 1 endpoints
  1> [2024-11-26T13:38:07,923][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Decreasing per node rate limit for endpoint -163709627 from 1000 to 500 tokens per time unit (node added)
...

Node removed from cluster:

...
  1> [2024-11-26T13:38:08,027][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Number of nodes in the cluster: 1
  1> [2024-11-26T13:38:08,027][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Updating rate limits for 1 endpoints
  1> [2024-11-26T13:38:08,028][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Increasing per node rate limit for endpoint -163709627 from 500 to 1000 tokens per time unit (node removed)

...

Command for running the test (it doesn't assert anything useful, but you can check the logs for the rate limit update):

./gradlew ":x-pack:plugin:inference:test" --tests "org.elasticsearch.xpack.inference.common.UpdateRateLimitsClusterServiceTests.testNodeJoinsRateLimitsUpdated"

@joshdevins
Copy link
Member

How does mulit-project Serverless affect this approach? Can we use the project ID somehow as part of the key in the rate limiter?

@timgrein
Copy link
Contributor Author

How does mulit-project Serverless affect this approach? Can we use the project ID somehow as part of the key in the rate limiter?

The current rate limiting inside the inference API supports grouping on basically anything:

private final ConcurrentMap<Object, RateLimitingEndpointHandler> rateLimitGroupings = new ConcurrentHashMap<>();

So right now I would assume that we can use the project id as a key for rate limit groupings.

@joshdevins
Copy link
Member

Ok. I guess for multi-project we have to also ensure that all nodes in a cluster are used by all projects, so that the maths are still the same to calculate actual rate limit.

@timgrein
Copy link
Contributor Author

Can be closed as #120400 merged

@timgrein timgrein closed this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants