[Inference API] Add node-local rate limiting for the inference API #120400

timgrein · 2025-01-17T15:30:42Z

This PR combines the approaches described in (I've described each idea in isolation in each PR):

Some important notes:

The functionality is guarded via a separate feature flag inference_cluster_aware_rate_limiting
This functionality is only enabled for the elastic inference provider in combination with the sparse_embedding task type

The combined high-level overview looks like the following:

…of InferencePlugin and adjust formatting.

timgrein · 2025-01-21T10:20:24Z

...a/org/elasticsearch/xpack/inference/common/InferenceServiceNodeLocalRateLimitCalculator.java

+
+        List<DiscoveryNode> assignedNodes = new ArrayList<>();
+
+        // TODO: here we can probably be smarter: if |num nodes in cluster| > |num nodes per task types|


This is something I kept out of this PR scope for now as we only need it as soon as we support multiple services and/or task types

…actually exists

elasticsearchmachine · 2025-01-23T13:32:46Z

Hi @timgrein, I've created a changelog YAML for you.

…ting' into inference-api-adaptive-rate-limiting

timgrein · 2025-01-29T22:33:34Z

...nce/src/main/java/org/elasticsearch/xpack/inference/action/BaseTransportInferenceAction.java

+    }
+
+    private NodeRoutingDecision determineRouting(String serviceName, Request request, UnparsedModel unparsedModel) {
+        if (INFERENCE_API_CLUSTER_AWARE_RATE_LIMITING_FEATURE_FLAG.isEnabled() == false) {


Not strictly necessary, but we can keep it for now and remove it after FF

elasticsearchmachine · 2025-01-29T23:52:01Z

💚 Backport successful

Status	Branch	Result
✅	8.x

…API (#120400) (#121251) * [Inference API] Add node-local rate limiting for the inference API (#120400) * Add node-local rate limiting for the inference API * Fix integration tests by using new LocalStateInferencePlugin instead of InferencePlugin and adjust formatting. * Correct feature flag name * Add more docs, reorganize methods and make some methods package private * Clarify comment in BaseInferenceActionRequest * Fix wrong merge * Fix checkstyle * Fix checkstyle in tests * Check that the service we want to the read the rate limit config for actually exists * [CI] Auto commit changes from spotless * checkStyle apply * Update docs/changelog/120400.yaml * Move rate limit division logic to RequestExecutorService * Spotless apply * Remove debug sout * Adding a few suggestions * Adam feedback * Fix compilation error * [CI] Auto commit changes from spotless * Add BWC test case to InferenceActionRequestTests * Add BWC test case to UnifiedCompletionActionRequestTests * Update x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/common/InferenceServiceNodeLocalRateLimitCalculator.java Co-authored-by: Adam Demjen <[email protected]> * Update x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/common/InferenceServiceNodeLocalRateLimitCalculator.java Co-authored-by: Adam Demjen <[email protected]> * Remove addressed TODO * Spotless apply * Only use new rate limit specific feature flag * Use ThreadLocalRandom * [CI] Auto commit changes from spotless * Use Randomness.get() * [CI] Auto commit changes from spotless * Fix import * Use ConcurrentHashMap in InferenceServiceNodeLocalRateLimitCalculator * Check for null value in getRateLimitAssignment and remove AtomicReference * Remove newAssignments * Up the default rate limit for completions * Put deprecated feature flag back in * Check feature flag in BaseTransportInferenceAction * spotlessApply * Export inference.common * Do not export inference.common * Provide noop rate limit calculator, if feature flag is disabled * Add proper dependency injection --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Jonathan Buttner <[email protected]> Co-authored-by: Adam Demjen <[email protected]> * Use .get(0) as getFirst() doesn't exist in 8.18 (probably JDK difference?) --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Jonathan Buttner <[email protected]> Co-authored-by: Adam Demjen <[email protected]>

Add node-local rate limiting for the inference API

e441ea8

timgrein added >non-issue auto-backport Automatically create backport pull requests when merged v9.0.0 v8.18.0 labels Jan 17, 2025

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jan 17, 2025

timgrein changed the title ~~[DRAFT][Inference API] Add node-local rate limiting for the inference API~~ [Inference API] Add node-local rate limiting for the inference API Jan 17, 2025

timgrein marked this pull request as draft January 17, 2025 16:18

timgrein and others added 7 commits January 20, 2025 14:50

Merge branch 'main' into inference-api-adaptive-rate-limiting

75ee9f4

Fix integration tests by using new LocalStateInferencePlugin instead …

73eb9d5

…of InferencePlugin and adjust formatting.

Correct feature flag name

6126047

Add more docs, reorganize methods and make some methods package private

091a276

Merge branch 'main' into inference-api-adaptive-rate-limiting

0a849b9

Clarify comment in BaseInferenceActionRequest

bbf5c0c

Fix wrong merge

01647b6

timgrein commented Jan 21, 2025

View reviewed changes

timgrein added 2 commits January 21, 2025 11:33

Fix checkstyle

e1d017f

Fix checkstyle in tests

e5b8768

timgrein added >feature and removed >non-issue labels Jan 21, 2025

timgrein and others added 4 commits January 21, 2025 16:52

Check that the service we want to the read the rate limit config for …

bae9487

…actually exists

[CI] Auto commit changes from spotless

e5b8adf

Merge branch 'main' into inference-api-adaptive-rate-limiting

dc4a79b

checkStyle apply

873d605

timgrein added :ml Machine learning Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels Jan 23, 2025

timgrein and others added 2 commits January 23, 2025 14:32

Update docs/changelog/120400.yaml

6fe7638

Merge branch 'main' into inference-api-adaptive-rate-limiting

1f3ca4e

timgrein and others added 21 commits January 29, 2025 08:50

Merge branch 'main' into inference-api-adaptive-rate-limiting

86531fd

Merge branch 'main' into inference-api-adaptive-rate-limiting

86db7a9

Merge branch 'main' into inference-api-adaptive-rate-limiting

d7acd02

Merge branch 'main' into inference-api-adaptive-rate-limiting

1f54382

Merge branch 'main' into inference-api-adaptive-rate-limiting

e8df11a

Merge branch 'main' into inference-api-adaptive-rate-limiting

9dc7c10

Merge branch 'main' into inference-api-adaptive-rate-limiting

71ae899

Merge branch 'main' into inference-api-adaptive-rate-limiting

5912c19

Merge branch 'main' into inference-api-adaptive-rate-limiting

67ecf1e

Merge branch 'main' into inference-api-adaptive-rate-limiting

d9f209a

Merge branch 'main' into inference-api-adaptive-rate-limiting

3906cb3

Put deprecated feature flag back in

40a6198

Merge remote-tracking branch 'origin/inference-api-adaptive-rate-limi…

c2a5391

…ting' into inference-api-adaptive-rate-limiting

Merge branch 'main' into inference-api-adaptive-rate-limiting

5052026

Check feature flag in BaseTransportInferenceAction

f856552

Merge remote-tracking branch 'origin/inference-api-adaptive-rate-limi…

e9a7cae

…ting' into inference-api-adaptive-rate-limiting

spotlessApply

57f28db

Export inference.common

aa33350

Do not export inference.common

059fc24

Provide noop rate limit calculator, if feature flag is disabled

2e0be6f

Add proper dependency injection

2143652

timgrein commented Jan 29, 2025

View reviewed changes

Merge branch 'main' into inference-api-adaptive-rate-limiting

4327df2

demjened merged commit a40370a into elastic:main Jan 29, 2025
16 checks passed

timgrein mentioned this pull request Jan 29, 2025

[8.x] [Inference API] Add node-local rate limiting for the inference API (#120400) #121251

Merged

timgrein mentioned this pull request Jan 30, 2025

[Inference API] Remove rate limiting feature flag check in BaseTransportInferenceAction and rely on noop implementation #121270

Merged

This was referenced Jan 31, 2025

[POC] Route completion request per service always to same node in a cluster #117705

Closed

[POC] "cluster-aware" rate limiting approximation inside the inference API #117505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference API] Add node-local rate limiting for the inference API #120400

[Inference API] Add node-local rate limiting for the inference API #120400

timgrein commented Jan 17, 2025 •

edited

Loading

timgrein Jan 21, 2025

elasticsearchmachine commented Jan 23, 2025

timgrein Jan 29, 2025

elasticsearchmachine commented Jan 29, 2025


		List<DiscoveryNode> assignedNodes = new ArrayList<>();

		// TODO: here we can probably be smarter: if \|num nodes in cluster\| > \|num nodes per task types\|

[Inference API] Add node-local rate limiting for the inference API #120400

[Inference API] Add node-local rate limiting for the inference API #120400

Conversation

timgrein commented Jan 17, 2025 • edited Loading

timgrein Jan 21, 2025

Choose a reason for hiding this comment

elasticsearchmachine commented Jan 23, 2025

timgrein Jan 29, 2025

Choose a reason for hiding this comment

elasticsearchmachine commented Jan 29, 2025

💚 Backport successful

timgrein commented Jan 17, 2025 •

edited

Loading