Enable speculative decoding #2777

mzegla · 2024-10-30T15:12:45Z

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

dtrawins · 2025-01-28T07:31:18Z

demos/continuous_batching/speculative_decoding/README.md

+  - [meta-llama/CodeLlama-7b-hf](https://huggingface.co/meta-llama/CodeLlama-7b-hf) as a main model
+  - [AMD-Llama-135m](https://huggingface.co/amd/AMD-Llama-135m) as a draft model
+
+both in FP16 precision.


why FP16? Can it be loaded on dGPU?

No particular reason. There were no tests on GPU, but speculative decoding reuses most of regular CB pipeline logic, so there should be not issue. Specifying target_device will propage to draft model also.

demos/continuous_batching/speculative_decoding/README.md

dtrawins · 2025-01-28T07:36:11Z

demos/continuous_batching/speculative_decoding/README.md

+
+Models used in this demo - `meta-llama/CodeLlama-7b-hf` and `AMD-Llama-135m` are not chat models, so we will use `completions` endpoint to interact with the pipeline.
+
+Below you can see an exemplary unary request (you can switch `stream` parameter to enable streamed response). Compared to calls to regular continuous batching model, this request has additional parameter `num_assistant_tokens` which specifies how many tokens should a draft model generate before main model validates them. 


What are the default values if both params are omitted?

There are none. One of those must be provided in the request.
I didn't specify any default because it's hard to recommend a single value that would work well for different combinations of main and draft models.

dtrawins · 2025-01-28T07:40:44Z

demos/continuous_batching/speculative_decoding/README.md

+> The draft model predicts the next K tokens one by one in an autoregressive manner. The main model validates these predictions and corrects them if necessary - in case of a discrepancy, the main model prediction is used. Then, the draft model acquires this token and runs prediction of the next K tokens, thus repeating the cycle.
+
+This demo shows how to use speculative decoding in the model serving scenario, by deploying main and draft models in a speculative decoding pipeline in a manner similar to regular deployments with continuous batching. 
+


Add a note that the goal of this algorithm is to reduce the latency while keeping the main model accuracy. It give the biggest gain in low concurrency requests.

demos/continuous_batching/speculative_decoding/README.md

dynamic decoding + separate scheduler

Co-authored-by: Trawinski, Dariusz <[email protected]>

dkalinowski · 2025-01-29T13:44:40Z

src/llm/llm_calculator.proto

+    // when draft_models_path is set, the pipeline will use speculative decoding
+    // other values are by default inherited from the main model when speculative decoding is enabled, but can be overridden
+    optional string draft_models_path = 11;
+


minor: consider creating internal sub-class for all draft model settings
it would be cleaner to define draft model in inner level:

node_options: { [type.googleapis.com/mediapipe.LLMCalculatorOptions]: { models_path: "/ovms/src/test/llm_testing/facebook/opt-125m", plugin_config: "{\"INFERENCE_PRECISION_HINT\":\"f32\"}" cache_size: 1 draft_model: { model_path: "/ovms/src/test/llm_testing/facebook/opt-125m" } } }

you can clearly see that model path is not optional, if we decide to include draft_model at all
it would naturally restrict situations when user specifies draft.device but does not specify draft.model_name which is wrong configuration
it would not require manual handling in llmnoderesources.cpp

dkalinowski · 2025-01-29T13:45:55Z

src/llm/llmnoderesources.hpp

@@ -36,6 +36,7 @@

 #include "../logging.hpp"
 #include "../stringutils.hpp"
+#include "src/llm/llm_calculator.pb.h"


we already are in src/llm, why do we need to specify it?

Suggested change

#include "src/llm/llm_calculator.pb.h"

#include "llm_calculator.pb.h"

It seems we have to. Doesn't compile with your suggestion.

dkalinowski · 2025-01-29T13:46:59Z

src/llm/llmnoderesources.cpp

@@ -153,6 +152,21 @@ Status LLMNodeResources::initializeLLMNodeResources(LLMNodeResources& nodeResour

    nodeResources.device = nodeOptions.device();

+    if (!nodeOptions.draft_models_path().empty()) {


User is not aware of any issue for pbtxt:

... node_options: { [type.googleapis.com/mediapipe.LLMCalculatorOptions]: { models_path: "/ovms/src/test/llm_testing/facebook/opt-125m", plugin_config: "{\"INFERENCE_PRECISION_HINT\":\"f32\"}" cache_size: 1 draft_device: "CPU" } } ...

dkalinowski · 2025-01-29T13:48:51Z

src/llm/apis/openai_completions.hpp

-    std::optional<float> repetitionPenalty{std::nullopt};
-    std::optional<float> lengthPenalty{std::nullopt};
-    std::optional<int> numReturnSequences{std::nullopt};
+    bool logprobs = 0;


bool = 0?
int = false?

{false} and = false; mix and match

dkalinowski · 2025-01-29T13:50:17Z

src/llm/apis/openai_completions.hpp

+        // Speculative decoding specific
+        if (numAssistantTokens.has_value())
+            config.num_assistant_tokens = numAssistantTokens.value();
+        if (assistantConfidenceThreshold.has_value())


I dont see a place where setting both parameters (which are exclusive) fails the request, do we have test for that?

https://github.com/openvinotoolkit/model_server/pull/2777/files#diff-dc1e3d8f1c59d392baedaaa031edd2ef7bbb062fa25cd68fe99b00011f9f770bR492

We have only positive tests now

added negative tests

dkalinowski · 2025-01-29T15:14:14Z

src/llm/llmnoderesources.cpp

+               || nodeOptions.has_draft_dynamic_split_fuse() || nodeOptions.has_draft_max_num_seqs()
+               || nodeOptions.has_draft_block_size() || nodeOptions.has_draft_device()) {
+        // Consider moving draft parameters to separate structure in node options, so it's validated on the proto level
+        SPDLOG_ERROR("Draft model path is not provided, but draft scheduler options are set. Ignoring draft scheduler options.");


src/llm/llmnoderesources.cpp

mzegla force-pushed the speculative_decoding branch 2 times, most recently from 03b8654 to 46acb7d Compare December 12, 2024 09:15

mzegla force-pushed the speculative_decoding branch from f07f188 to 6912a7d Compare January 17, 2025 09:34

mzegla marked this pull request as ready for review January 22, 2025 13:43

mzegla requested review from dtrawins, dkalinowski and ngrozae January 22, 2025 13:44

mzegla added this to the 2025.0_RC milestone Jan 27, 2025

dtrawins reviewed Jan 28, 2025

View reviewed changes

dtrawins approved these changes Jan 28, 2025

View reviewed changes

mzegla and others added 12 commits January 28, 2025 16:26

init

01c4380

dynamic decoding + separate scheduler

revert patch removal

a14a45a

style

92402b0

fix

41e5e65

export model script

8c835ee

add demo

ddef9fa

tests

09b7e0c

modify SD chat test

7e8b0ae

style

484b48b

add new parameters to API documentation

1e82d0c

Apply suggestions from code review

809e83f

Co-authored-by: Trawinski, Dariusz <[email protected]>

review

f2e1801

mzegla force-pushed the speculative_decoding branch from cc4e9b8 to f2e1801 Compare January 28, 2025 15:27

dkalinowski reviewed Jan 29, 2025

View reviewed changes

mzegla added 2 commits January 29, 2025 15:27

review

a0f97f8

negative SD tests

2e6b7cc

mzegla added 2 commits January 29, 2025 16:03

style

145c6c9

Merge branch 'main' into speculative_decoding

009bdc9

dkalinowski reviewed Jan 29, 2025

View reviewed changes

dkalinowski approved these changes Jan 29, 2025

View reviewed changes

mzegla commented Jan 29, 2025

View reviewed changes

src/llm/llmnoderesources.cpp Outdated Show resolved Hide resolved

mzegla added 2 commits January 29, 2025 16:17

Update src/llm/llmnoderesources.cpp

d6d26e9

Merge branch 'main' into speculative_decoding

6e00d44

mzegla merged commit 6799329 into main Jan 29, 2025
14 checks passed

bstrzele pushed a commit that referenced this pull request Jan 31, 2025

Enable speculative decoding (#2777)

a2643fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable speculative decoding #2777

Enable speculative decoding #2777

mzegla commented Oct 30, 2024 •

edited

Loading

dtrawins Jan 28, 2025

mzegla Jan 28, 2025

dtrawins Jan 28, 2025

mzegla Jan 28, 2025

dtrawins Jan 28, 2025

dkalinowski Jan 29, 2025

dkalinowski Jan 29, 2025

mzegla Jan 29, 2025

dkalinowski Jan 29, 2025 •

edited

Loading

dkalinowski Jan 29, 2025

dkalinowski Jan 29, 2025

dkalinowski Jan 29, 2025

mzegla Jan 29, 2025

mzegla Jan 29, 2025

mzegla Jan 29, 2025

dkalinowski Jan 29, 2025


		Models used in this demo - `meta-llama/CodeLlama-7b-hf` and `AMD-Llama-135m` are not chat models, so we will use `completions` endpoint to interact with the pipeline.

		Below you can see an exemplary unary request (you can switch `stream` parameter to enable streamed response). Compared to calls to regular continuous batching model, this request has additional parameter `num_assistant_tokens` which specifies how many tokens should a draft model generate before main model validates them.

		> The draft model predicts the next K tokens one by one in an autoregressive manner. The main model validates these predictions and corrects them if necessary - in case of a discrepancy, the main model prediction is used. Then, the draft model acquires this token and runs prediction of the next K tokens, thus repeating the cycle.

		This demo shows how to use speculative decoding in the model serving scenario, by deploying main and draft models in a speculative decoding pipeline in a manner similar to regular deployments with continuous batching.

	#include "src/llm/llm_calculator.pb.h"
	#include "llm_calculator.pb.h"

		@@ -153,6 +152,21 @@ Status LLMNodeResources::initializeLLMNodeResources(LLMNodeResources& nodeResour

		nodeResources.device = nodeOptions.device();

		if (!nodeOptions.draft_models_path().empty()) {

Enable speculative decoding #2777

Enable speculative decoding #2777

Conversation

mzegla commented Oct 30, 2024 • edited Loading

🛠 Summary

🧪 Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkalinowski Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzegla commented Oct 30, 2024 •

edited

Loading

dkalinowski Jan 29, 2025 •

edited

Loading