[POC] "cluster-aware" rate limiting approximation inside the inference API #117505

timgrein · 2024-11-25T17:05:39Z

POC for "cluster-aware" rate limiting approximation inside the inference API (I didn't think about good abstractions or edge cases, so take this with a grain of salt; this is just to get the main idea across).

The current rate limiting mechanism for inference endpoints works like the following: You specify a rate limit (or the default one is used) as requests_per_minute. We count the requests using the RateLimiter, which implements the token bucket algorithm.

The problem is that counting does not use "cluster-wide" state, but "local node" state meaning that each node performs its own request counting. Assuming a ~ uniform distribution of requests the cluster-wide rate limit is effectively |nodes| * requests_per_minute. As we need a cluster-wide rate limit for our inference service we want to approximate a "cluster-wide" rate limit by dividing the specified rate limit by the number of nodes.

The rate limit update happens as soon as a node joins or leaves the cluster.

I couldn't figure out how to capture logs of newly started nodes yet, but I see the correct log messages coming from the UpdateRateLimitsClusterService:

Node added to cluster:

...
  1> [2024-11-26T13:38:07,923][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Number of nodes in the cluster: 2
  1> [2024-11-26T13:38:07,923][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Updating rate limits for 1 endpoints
  1> [2024-11-26T13:38:07,923][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Decreasing per node rate limit for endpoint -163709627 from 1000 to 500 tokens per time unit (node added)
...

Node removed from cluster:

...
  1> [2024-11-26T13:38:08,027][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Number of nodes in the cluster: 1
  1> [2024-11-26T13:38:08,027][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Updating rate limits for 1 endpoints
  1> [2024-11-26T13:38:08,028][INFO ][o.e.x.i.c.UpdateRateLimitsClusterService] [node_s0] Increasing per node rate limit for endpoint -163709627 from 500 to 1000 tokens per time unit (node removed)

...

Command for running the test (it doesn't assert anything useful, but you can check the logs for the rate limit update):

./gradlew ":x-pack:plugin:inference:test" --tests "org.elasticsearch.xpack.inference.common.UpdateRateLimitsClusterServiceTests.testNodeJoinsRateLimitsUpdated"

…nce API

joshdevins · 2024-11-28T16:13:48Z

How does mulit-project Serverless affect this approach? Can we use the project ID somehow as part of the key in the rate limiter?

timgrein · 2024-11-28T16:30:57Z

How does mulit-project Serverless affect this approach? Can we use the project ID somehow as part of the key in the rate limiter?

The current rate limiting inside the inference API supports grouping on basically anything:

elasticsearch/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

Line 100 in ab604ad

    
           private final ConcurrentMap<Object, RateLimitingEndpointHandler> rateLimitGroupings = new ConcurrentHashMap<>();

So right now I would assume that we can use the project id as a key for rate limit groupings.

joshdevins · 2024-11-28T16:43:51Z

Ok. I guess for multi-project we have to also ensure that all nodes in a cluster are used by all projects, so that the maths are still the same to calculate actual rate limit.

timgrein · 2025-01-31T14:19:59Z

Can be closed as #120400 merged

POC for "cluster-aware" rate limiting approximation inside the infere…

b912f77

…nce API

elasticsearchmachine added the v9.0.0 label Nov 25, 2024

timgrein added 4 commits November 25, 2024 18:08

TODO: Treat node joining and node leaving differently

8e9eaa4

Fix expectation

55164a3

Adapt TODO

faace9e

Handle node added/removed correctly

079db62

timgrein mentioned this pull request Nov 28, 2024

[POC] Route completion request per service always to same node in a cluster #117705

Closed

timgrein mentioned this pull request Jan 20, 2025

[Inference API] Add node-local rate limiting for the inference API #120400

Merged

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

timgrein closed this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] "cluster-aware" rate limiting approximation inside the inference API #117505

[POC] "cluster-aware" rate limiting approximation inside the inference API #117505

timgrein commented Nov 25, 2024 •

edited

Loading

joshdevins commented Nov 28, 2024

timgrein commented Nov 28, 2024

joshdevins commented Nov 28, 2024

timgrein commented Jan 31, 2025

[POC] "cluster-aware" rate limiting approximation inside the inference API #117505

[POC] "cluster-aware" rate limiting approximation inside the inference API #117505

Conversation

timgrein commented Nov 25, 2024 • edited Loading

joshdevins commented Nov 28, 2024

timgrein commented Nov 28, 2024

joshdevins commented Nov 28, 2024

timgrein commented Jan 31, 2025

timgrein commented Nov 25, 2024 •

edited

Loading