Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable speculative decoding #2777

Merged
merged 18 commits into from
Jan 29, 2025
Merged

Enable speculative decoding #2777

merged 18 commits into from
Jan 29, 2025

Conversation

mzegla
Copy link
Collaborator

@mzegla mzegla commented Oct 30, 2024

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

@mzegla mzegla force-pushed the speculative_decoding branch 2 times, most recently from 03b8654 to 46acb7d Compare December 12, 2024 09:15
@mzegla mzegla force-pushed the speculative_decoding branch from f07f188 to 6912a7d Compare January 17, 2025 09:34
@mzegla mzegla marked this pull request as ready for review January 22, 2025 13:43
@mzegla mzegla added this to the 2025.0_RC milestone Jan 27, 2025
- [meta-llama/CodeLlama-7b-hf](https://huggingface.co/meta-llama/CodeLlama-7b-hf) as a main model
- [AMD-Llama-135m](https://huggingface.co/amd/AMD-Llama-135m) as a draft model

both in FP16 precision.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why FP16? Can it be loaded on dGPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason. There were no tests on GPU, but speculative decoding reuses most of regular CB pipeline logic, so there should be not issue. Specifying target_device will propage to draft model also.

demos/continuous_batching/speculative_decoding/README.md Outdated Show resolved Hide resolved

Models used in this demo - `meta-llama/CodeLlama-7b-hf` and `AMD-Llama-135m` are not chat models, so we will use `completions` endpoint to interact with the pipeline.

Below you can see an exemplary unary request (you can switch `stream` parameter to enable streamed response). Compared to calls to regular continuous batching model, this request has additional parameter `num_assistant_tokens` which specifies how many tokens should a draft model generate before main model validates them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the default values if both params are omitted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are none. One of those must be provided in the request.
I didn't specify any default because it's hard to recommend a single value that would work well for different combinations of main and draft models.

> The draft model predicts the next K tokens one by one in an autoregressive manner. The main model validates these predictions and corrects them if necessary - in case of a discrepancy, the main model prediction is used. Then, the draft model acquires this token and runs prediction of the next K tokens, thus repeating the cycle.

This demo shows how to use speculative decoding in the model serving scenario, by deploying main and draft models in a speculative decoding pipeline in a manner similar to regular deployments with continuous batching.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that the goal of this algorithm is to reduce the latency while keeping the main model accuracy. It give the biggest gain in low concurrency requests.

demos/continuous_batching/speculative_decoding/README.md Outdated Show resolved Hide resolved
@mzegla mzegla force-pushed the speculative_decoding branch from cc4e9b8 to f2e1801 Compare January 28, 2025 15:27
// when draft_models_path is set, the pipeline will use speculative decoding
// other values are by default inherited from the main model when speculative decoding is enabled, but can be overridden
optional string draft_models_path = 11;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: consider creating internal sub-class for all draft model settings
it would be cleaner to define draft model in inner level:

    node_options: {
        [type.googleapis.com/mediapipe.LLMCalculatorOptions]: {
          models_path: "/ovms/src/test/llm_testing/facebook/opt-125m",
          plugin_config: "{\"INFERENCE_PRECISION_HINT\":\"f32\"}"
          cache_size: 1
          draft_model: {
              model_path: "/ovms/src/test/llm_testing/facebook/opt-125m"
          }
        }
    }

you can clearly see that model path is not optional, if we decide to include draft_model at all
it would naturally restrict situations when user specifies draft.device but does not specify draft.model_name which is wrong configuration
it would not require manual handling in llmnoderesources.cpp

@@ -36,6 +36,7 @@

#include "../logging.hpp"
#include "../stringutils.hpp"
#include "src/llm/llm_calculator.pb.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already are in src/llm, why do we need to specify it?

Suggested change
#include "src/llm/llm_calculator.pb.h"
#include "llm_calculator.pb.h"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we have to. Doesn't compile with your suggestion.

@@ -153,6 +152,21 @@ Status LLMNodeResources::initializeLLMNodeResources(LLMNodeResources& nodeResour

nodeResources.device = nodeOptions.device();

if (!nodeOptions.draft_models_path().empty()) {
Copy link
Collaborator

@dkalinowski dkalinowski Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User is not aware of any issue for pbtxt:

    ...
    node_options: {
        [type.googleapis.com/mediapipe.LLMCalculatorOptions]: {
          models_path: "/ovms/src/test/llm_testing/facebook/opt-125m",
          plugin_config: "{\"INFERENCE_PRECISION_HINT\":\"f32\"}"
          cache_size: 1
          draft_device: "CPU"
        }
    }
    ...

std::optional<float> repetitionPenalty{std::nullopt};
std::optional<float> lengthPenalty{std::nullopt};
std::optional<int> numReturnSequences{std::nullopt};
bool logprobs = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool = 0?
int = false?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{false} and = false; mix and match

// Speculative decoding specific
if (numAssistantTokens.has_value())
config.num_assistant_tokens = numAssistantTokens.value();
if (assistantConfidenceThreshold.has_value())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see a place where setting both parameters (which are exclusive) fails the request, do we have test for that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have only positive tests now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added negative tests

|| nodeOptions.has_draft_dynamic_split_fuse() || nodeOptions.has_draft_max_num_seqs()
|| nodeOptions.has_draft_block_size() || nodeOptions.has_draft_device()) {
// Consider moving draft parameters to separate structure in node options, so it's validated on the proto level
SPDLOG_ERROR("Draft model path is not provided, but draft scheduler options are set. Ignoring draft scheduler options.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong msg

src/llm/llmnoderesources.cpp Outdated Show resolved Hide resolved
@mzegla mzegla merged commit 6799329 into main Jan 29, 2025
14 checks passed
bstrzele pushed a commit that referenced this pull request Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants