-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable speculative decoding #2777
Conversation
03b8654
to
46acb7d
Compare
f07f188
to
6912a7d
Compare
- [meta-llama/CodeLlama-7b-hf](https://huggingface.co/meta-llama/CodeLlama-7b-hf) as a main model | ||
- [AMD-Llama-135m](https://huggingface.co/amd/AMD-Llama-135m) as a draft model | ||
|
||
both in FP16 precision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why FP16? Can it be loaded on dGPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason. There were no tests on GPU, but speculative decoding reuses most of regular CB pipeline logic, so there should be not issue. Specifying target_device
will propage to draft model also.
|
||
Models used in this demo - `meta-llama/CodeLlama-7b-hf` and `AMD-Llama-135m` are not chat models, so we will use `completions` endpoint to interact with the pipeline. | ||
|
||
Below you can see an exemplary unary request (you can switch `stream` parameter to enable streamed response). Compared to calls to regular continuous batching model, this request has additional parameter `num_assistant_tokens` which specifies how many tokens should a draft model generate before main model validates them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the default values if both params are omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are none. One of those must be provided in the request.
I didn't specify any default because it's hard to recommend a single value that would work well for different combinations of main and draft models.
> The draft model predicts the next K tokens one by one in an autoregressive manner. The main model validates these predictions and corrects them if necessary - in case of a discrepancy, the main model prediction is used. Then, the draft model acquires this token and runs prediction of the next K tokens, thus repeating the cycle. | ||
|
||
This demo shows how to use speculative decoding in the model serving scenario, by deploying main and draft models in a speculative decoding pipeline in a manner similar to regular deployments with continuous batching. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note that the goal of this algorithm is to reduce the latency while keeping the main model accuracy. It give the biggest gain in low concurrency requests.
Co-authored-by: Trawinski, Dariusz <[email protected]>
cc4e9b8
to
f2e1801
Compare
// when draft_models_path is set, the pipeline will use speculative decoding | ||
// other values are by default inherited from the main model when speculative decoding is enabled, but can be overridden | ||
optional string draft_models_path = 11; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: consider creating internal sub-class for all draft model settings
it would be cleaner to define draft model in inner level:
node_options: {
[type.googleapis.com/mediapipe.LLMCalculatorOptions]: {
models_path: "/ovms/src/test/llm_testing/facebook/opt-125m",
plugin_config: "{\"INFERENCE_PRECISION_HINT\":\"f32\"}"
cache_size: 1
draft_model: {
model_path: "/ovms/src/test/llm_testing/facebook/opt-125m"
}
}
}
you can clearly see that model path is not optional, if we decide to include draft_model
at all
it would naturally restrict situations when user specifies draft.device
but does not specify draft.model_name
which is wrong configuration
it would not require manual handling in llmnoderesources.cpp
@@ -36,6 +36,7 @@ | |||
|
|||
#include "../logging.hpp" | |||
#include "../stringutils.hpp" | |||
#include "src/llm/llm_calculator.pb.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already are in src/llm, why do we need to specify it?
#include "src/llm/llm_calculator.pb.h" | |
#include "llm_calculator.pb.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we have to. Doesn't compile with your suggestion.
@@ -153,6 +152,21 @@ Status LLMNodeResources::initializeLLMNodeResources(LLMNodeResources& nodeResour | |||
|
|||
nodeResources.device = nodeOptions.device(); | |||
|
|||
if (!nodeOptions.draft_models_path().empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User is not aware of any issue for pbtxt:
...
node_options: {
[type.googleapis.com/mediapipe.LLMCalculatorOptions]: {
models_path: "/ovms/src/test/llm_testing/facebook/opt-125m",
plugin_config: "{\"INFERENCE_PRECISION_HINT\":\"f32\"}"
cache_size: 1
draft_device: "CPU"
}
}
...
src/llm/apis/openai_completions.hpp
Outdated
std::optional<float> repetitionPenalty{std::nullopt}; | ||
std::optional<float> lengthPenalty{std::nullopt}; | ||
std::optional<int> numReturnSequences{std::nullopt}; | ||
bool logprobs = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool = 0?
int = false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{false} and = false; mix and match
// Speculative decoding specific | ||
if (numAssistantTokens.has_value()) | ||
config.num_assistant_tokens = numAssistantTokens.value(); | ||
if (assistantConfidenceThreshold.has_value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see a place where setting both parameters (which are exclusive) fails the request, do we have test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have only positive tests now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added negative tests
src/llm/llmnoderesources.cpp
Outdated
|| nodeOptions.has_draft_dynamic_split_fuse() || nodeOptions.has_draft_max_num_seqs() | ||
|| nodeOptions.has_draft_block_size() || nodeOptions.has_draft_device()) { | ||
// Consider moving draft parameters to separate structure in node options, so it's validated on the proto level | ||
SPDLOG_ERROR("Draft model path is not provided, but draft scheduler options are set. Ignoring draft scheduler options."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong msg
🛠 Summary
JIRA/Issue if applicable.
Describe the changes.
🧪 Checklist
``