[POC] "cluster-aware" rate limiting approximation inside the inference API #117505
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
POC for "cluster-aware" rate limiting approximation inside the inference API (I didn't think about good abstractions or edge cases, so take this with a grain of salt; this is just to get the main idea across).
The current rate limiting mechanism for inference endpoints works like the following: You specify a rate limit (or the default one is used) as
requests_per_minute
. We count the requests using theRateLimiter
, which implements the token bucket algorithm.The problem is that counting does not use "cluster-wide" state, but "local node" state meaning that each node performs its own request counting. Assuming a ~ uniform distribution of requests the cluster-wide rate limit is effectively
|nodes| * requests_per_minute
. As we need a cluster-wide rate limit for our inference service we want to approximate a "cluster-wide" rate limit by dividing the specified rate limit by the number of nodes.The rate limit update happens as soon as a node joins or leaves the cluster.
I couldn't figure out how to capture logs of newly started nodes yet, but I see the correct log messages coming from the
UpdateRateLimitsClusterService
:Node added to cluster:
Node removed from cluster:
Command for running the test (it doesn't assert anything useful, but you can check the logs for the rate limit update):