diff --git a/serving/docs/lmi/README.md b/serving/docs/lmi/README.md index 24bd5bb60..7199a242e 100644 --- a/serving/docs/lmi/README.md +++ b/serving/docs/lmi/README.md @@ -30,7 +30,7 @@ LMI containers provide many features, including: LMI containers provide these features through integrations with popular inference libraries. A unified configuration format enables users to easily leverage the latest optimizations and technologies across libraries. We will refer to each of these libraries as `backends` throughout the documentation. -The term backend refers to a combination of Engine (LMI uses the Python Engine) and inference library. +The term backend refers to a combination of Engine (LMI uses the Python Engine) and inference library (like vLLM). You can learn more about the components of LMI [here](deployment_guide/README.md#components-of-lmi). ## QuickStart @@ -74,11 +74,10 @@ This information is also available on the SageMaker DLC [GitHub repository](http | Backend | SageMakerDLC | Example URI | |------------------------|-----------------|-------------------------------------------------------------------------------------------| -| `vLLM` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124 | -| `lmi-dist` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124 | -| `hf-accelerate` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124 | +| `vLLM` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124 | +| `lmi-dist` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124 | | `tensorrt-llm` | djl-tensorrtllm | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-tensorrtllm0.12.0-cu125 | -| `transformers-neuronx` | djl-neuronx | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1 | +| `transformers-neuronx` | djl-neuronx | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-neuronx-sdk2.20.1 | ## Advanced Features @@ -92,6 +91,3 @@ LMI contains also contain several advanced features that can be used for more co The LMI team maintains sample SageMaker notebooks in the [djl-demo repository](https://github.com/deepjavalibrary/djl-demo/tree/master/aws/sagemaker/large-model-inference/sample-llm). This repository contains the most up-to-date notebooks for LMI. Notebooks are updated with every release, and new notebooks are added to demonstrate new features and capabilities. - -Additionally, the [SageMaker GenAI Hosting Examples](https://github.com/aws-samples/sagemaker-genai-hosting-examples) repository contains additional examples. -However, the notebooks here are not updated as frequently and may be stale. diff --git a/serving/docs/lmi/user_guides/chat_input_output_schema.md b/serving/docs/lmi/user_guides/chat_input_output_schema.md index d119b1a79..133bef525 100644 --- a/serving/docs/lmi/user_guides/chat_input_output_schema.md +++ b/serving/docs/lmi/user_guides/chat_input_output_schema.md @@ -9,13 +9,15 @@ If the request contains the "messages" field, LMI will treat the request as a ch back with the chat completions response style. When using the Chat Completions Schema, you should make sure that the model you are serving has a chat template. -The chat template ensures that the messages object is tokenized appropriately for your model. +The chat template ensures that the payload is tokenized appropriately for your model. See [the HuggingFace documentation on chat templates](https://huggingface.co/docs/transformers/main/en/chat_templating) for more information. This processing happens per request, meaning that you can support our [standard schema](lmi_input_output_schema.md), as well as chat completions schema in the same endpoint. Note: This is an experimental feature. The complete spec has not been implemented. +In particular, function calling is not currently available. +We are targeting function calling support in our next release, 0.32.0, planned for January 2025. ## Request Schema @@ -116,9 +118,67 @@ The response is returned token by token as application/jsonlines content-type: Example response: ``` -{"id": "chatcmpl-0", "object": "chat.completion.chunk", "created": 1712792433, "choices": [{"index": 0, "delta": {"content": " Oh", "role": "assistant"}, "logprobs": [{"content": [{"token": " Oh", "logprob": -4.499478340148926, "bytes": [32, 79, 104], "top_logprobs": [{"token": -4.499478340148926, "logprob": -4.499478340148926, "bytes": [32, 79, 104]}]}]}], "finish_reason": null}]} +{ + "id": "chatcmpl-0", + "object": "chat.completion.chunk", + "created": 1712792433, + "choices": [ + { + "index": 0, + "delta": {"content": " Oh", "role": "assistant"}, + "logprobs": [ + { + "content": [ + { + "token": " Oh", + "logprob": -4.499478340148926, + "bytes": [32, 79, 104], + "top_logprobs": [ + { + "token": -4.499478340148926, + "logprob": -4.499478340148926, + "bytes": [32, 79, 104] + } + ] + } + ] + } + ], + "finish_reason": null + } + ] +} ... -{"id": "chatcmpl-0", "object": "chat.completion.chunk", "created": 1712792436, "choices": [{"index": 0, "delta": {"content": " assist"}, "logprobs": [{"content": [{"token": " assist", "logprob": -1.019672155380249, "bytes": [32, 97, 115, 115, 105, 115, 116], "top_logprobs": [{"token": -1.019672155380249, "logprob": -1.019672155380249, "bytes": [32, 97, 115, 115, 105, 115, 116]}]}]}], "finish_reason": "length"}]} +{ + "id": "chatcmpl-0", + "object": "chat.completion.chunk", + "created": 1712792436, + "choices": [ + { + "index": 0, + "delta": {"content": " assist"}, + "logprobs": [ + { + "content": [ + { + "token": " assist", + "logprob": -1.019672155380249, + "bytes": [32, 97, 115, 115, 105, 115, 116], + "top_logprobs": [ + { + "token": -1.019672155380249, + "logprob": -1.019672155380249, + "bytes": [32, 97, 115, 115, 105, 115, 116] + } + ] + } + ] + } + ], + "finish_reason": "length" + } + ] +} ``` ## API Object Schemas @@ -146,7 +206,6 @@ Example: #### Vision/Image Support -Starting in v0.29.0, we have added experimental support for vision language models. You can specify an image as part of the content when using a vision language model. Image data can either be specified as a url, or via a base64 encoding of the image data. @@ -171,18 +230,19 @@ Example: ``` We recommend that you use the base64 encoding to ensure no network failures occur when retrieving the image within the endpoint. +Network calls to fetch images can increase latency and introduce another failure point. ### Choice The choice object represents a chat completion choice. It contains the following fields: -| Field Name | Type | Description | Example | -|-----------------|-----------------------|---------------------------------------------------|-------------------------------------------| -| `index` | int | The index of the choice | 0 | -| `message` | [Message](#message) | A chat completion message generated by the model. | See the [Message](#message) documentation | -| `logprobs` | [Logprobs](#logprobs) | The log probability of the token | See the [Logprobs](#logprob) documentation | -| `finish_reason` | string enum | The reason the model stopped generating tokens | "length", "eos_token", "stop_sequence" | +| Field Name | Type | Description | Example | +|-----------------|-----------------------|---------------------------------------------------|--------------------------------------------| +| `index` | int | The index of the choice | 0 | +| `message` | [Message](#message) | A chat completion message generated by the model. | See the [Message](#message) documentation | +| `logprobs` | [Logprobs](#logprobs) | The log probability of the token | See the [Logprobs](#logprob) documentation | +| `finish_reason` | string enum | The reason the model stopped generating tokens | "length", "eos_token", "stop_sequence" | Example: @@ -213,7 +273,28 @@ It contains the following fields: Example: ``` -{"index": 0, "delta": {"content": " Oh", "role": "assistant"}, "logprobs": [{"content": [{"token": " Oh", "logprob": -4.499478340148926, "bytes": [32, 79, 104], "top_logprobs": [{"token": -4.499478340148926, "logprob": -4.499478340148926, "bytes": [32, 79, 104]}]}]} +{ + "index": 0, + "delta": {"content": " Oh", "role": "assistant"}, + "logprobs": [ + { + "content": [ + { + "token": " Oh", + "logprob": -4.499478340148926, + "bytes": [32, 79, 104], + "top_logprobs": [ + { + "token": -4.499478340148926, + "logprob": -4.499478340148926, + "bytes": [32, 79, 104] + } + ] + } + ] + } + ] +} ``` ### Logprobs diff --git a/serving/docs/lmi/user_guides/embedding-user-guide.md b/serving/docs/lmi/user_guides/embedding-user-guide.md index 5fbde0ed8..64c44ad92 100644 --- a/serving/docs/lmi/user_guides/embedding-user-guide.md +++ b/serving/docs/lmi/user_guides/embedding-user-guide.md @@ -18,7 +18,7 @@ LMI supports Text Embedding Inference with the following engines: - Rust - Python -Currently, the OnnxRuntime engine provides the best performance for text embedding in LMI. +Currently, the Rust engine provides the best performance for text embedding in LMI. ## Quick Start Configurations @@ -36,11 +36,11 @@ SERVING_BATCH_SIZE=32 ### environment variables -You can specify the `HF_MODEL_ID` environment variable to load a model from HuggingFace hub. DJLServing -will download the model from HuggingFace hub and optimize the model with OnnxRuntime at runtime. +You can specify the `HF_MODEL_ID` environment variable to load a model from HuggingFace hub, DJL Model Zoo, AWS S3, or a local path. +DJLServing will download the model from HuggingFace hub and optimize the model with the selected engine at runtime. ``` -OPTION_ENGINE=OnnxRuntime +OPTION_ENGINE=Rust HF_MODEL_ID=BAAI/bge-base-en-v1.5 # Optional SERVING_BATCH_SIZE=32 @@ -52,7 +52,7 @@ to deploy a model with environment variable configuration on SageMaker. ### serving.properties ``` -engine=OnnxRuntime +engine=Rust option.model_id=BAAI/bge-base-en-v1.5 translatorFactory=ai.djl.huggingface.translator.TextEmbeddingTranslatorFactory # Optional diff --git a/serving/docs/lmi/user_guides/lmi-dist_user_guide.md b/serving/docs/lmi/user_guides/lmi-dist_user_guide.md index 2b6e6a692..4c691bd32 100644 --- a/serving/docs/lmi/user_guides/lmi-dist_user_guide.md +++ b/serving/docs/lmi/user_guides/lmi-dist_user_guide.md @@ -8,16 +8,16 @@ LMI-Dist expects the model to be in the [standard HuggingFace format](../deploym **Text Generation Models** -LMI-Dist supports the same set of text-generation models as [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/models/supported_models.html#decoder-only-language-models). +LMI-Dist supports the same set of text-generation models as [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/models/supported_models.html#decoder-only-language-models). In addition to the vllm models, LMI-Dist also supports the t5 model family (e.g. google/flan-t5-xl). **Multi Modal Models** -LMI-Dist supports the same set of multi-modal models as [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/models/supported_models.html#decoder-only-language-models). +LMI-Dist supports the same set of multi-modal models as [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/models/supported_models.html#decoder-only-language-models). However, the one known exception is MLlama (Llama3.2 multimodal models). -MLlama support is expected in the v13 (0.31.0) release. +MLlama support is expected in the v13 (0.32.0) release. ### Model Coverage in CI @@ -92,7 +92,7 @@ Please check that your base model [supports LoRA adapters in vLLM](https://docs. ## Quantization Support -LMI-Dist supports the same quantization techniques as [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/quantization/supported_hardware.html). +LMI-Dist supports the same quantization techniques as [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/quantization/supported_hardware.html). We highly recommend that regardless of which quantization technique you are using that you pre-quantize the model. Runtime quantization adds additional overhead to the endpoint startup time, and depending on the quantization technique, this can be significant overhead. diff --git a/serving/docs/lmi/user_guides/lmi_input_output_schema.md b/serving/docs/lmi/user_guides/lmi_input_output_schema.md index 16e4eabfe..11747fd38 100644 --- a/serving/docs/lmi/user_guides/lmi_input_output_schema.md +++ b/serving/docs/lmi/user_guides/lmi_input_output_schema.md @@ -1,7 +1,7 @@ # LMI handlers Inference API Schema This document provides the default API schema for the inference endpoints (`/invocations`, `/predictions/`) when using the built-in inference handlers in LMI containers. -This schema is applicable to our latest release, v0.28.0. +This schema is applicable to our latest release, v0.31.0. Documentation for previous releases is available on our GitHub on the relevant version branch (e.g. 0.27.0-dlc). LMI provides two distinct schemas depending on what type of batching you use: @@ -41,7 +41,7 @@ curl -X POST https://my.sample.endpoint.com/invocations \ ### Response Schema -When not using streaming (this is the default), the response is returned as application/json content-type: +When not using streaming (the default), the response is returned as application/json content-type: | Field Name | Field Type | Always Returned | Possible Values | |------------------|---------------------|-------------------------------------------|-------------------------------------------------------------------------------------| @@ -84,7 +84,10 @@ Example response: } ``` -When using streaming, if you want Server Side Events, then you could use `option.output_formatter=sse`. If you `stream=True`, the default `output_formatter` is `jsonlines`. So you would want to explicitly provide `option.output_formatter=sse` when you want SSE with streaming. Check out `TGI_COMPAT` option below, enabling that option will make SSE as the default formatter with streaming. +When using streaming, if you want Server Side Events, then you could use `option.output_formatter=sse`. +If you `stream=True`, the default `output_formatter` is `jsonlines`. +So you would want to explicitly provide `option.output_formatter=sse` when you want SSE with streaming. +Check out `TGI_COMPAT` option below, enabling that option will make SSE as the default formatter with streaming. When using SSE the jsonline will have the prefix `data`. Example response: @@ -103,7 +106,7 @@ data:{ #### Error Responses -Errors can typically happen in 2 places: +Errors can typically happen in two places: - Before inference has started - During token generation (in the middle of inference) @@ -154,7 +157,8 @@ When using streaming: ## Response with TGI compatibility -In order to get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true` or `option.tgi_compat=true` in your serving.properties. Right now, DJLServing for LMI with rolling batch has minor differences in the response schema compared to TGI. +In order to get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true` or `option.tgi_compat=true` in your serving.properties. +Right now, DJLServing for LMI with rolling batch has minor differences in the response schema compared to TGI. This feature is designed for customers transitioning from TGI, making their lives easier by allowing them to continue using their client-side code without any special modifications for our LMI containers or DJLServing. Enabling the tgi_compat option would make the response look like below: @@ -418,7 +422,10 @@ Example: ### BestOfSequence -Generated text and its details is the one with the highest log probability. Others sequences are returned as best_of_sequences. You can enable this with n > 1. It is also returned when beam search is enabled with the option num_beams > 1. +Generated text and its details is the one with the highest log probability. +Other sequences are returned as best_of_sequences. +You can enable this with n > 1. +It is also returned when beam search is enabled with the option num_beams > 1. Note that best_of_sequences will only work with non-streaming case. @@ -434,4 +441,4 @@ Note that best_of_sequences will only work with non-streaming case. If you wish to create your own pre-processing and post-processing for our handlers, check out these guides [Custom input format schema guide](input_formatter_schema.md) and [Custom output format schema guide](output_formatter_schema.md). -This is not an officially supported use-case. The API signature, as well as implementation, is subject to change at any time. \ No newline at end of file +This is an experimental use-case. The API signature, as well as implementation, is subject to change at any time. \ No newline at end of file diff --git a/serving/docs/lmi/user_guides/release_notes.md b/serving/docs/lmi/user_guides/release_notes.md index d44a5a171..dcb234c3f 100644 --- a/serving/docs/lmi/user_guides/release_notes.md +++ b/serving/docs/lmi/user_guides/release_notes.md @@ -1,4 +1,4 @@ -# LMI V12 DLC containers release +# LMI V13 DLC containers release This document will contain the latest releases of our LMI containers for use on SageMaker. For details on any other previous releases, please refer our [github release page](https://github.com/deepjavalibrary/djl-serving/releases) @@ -7,39 +7,13 @@ For details on any other previous releases, please refer our [github release pag ### Key Features -#### DJL Serving Changes (applicable to all containers) -* Fixed a bug related to HTTP error code and response handling when using a rolling batch/continuous batching engine: - * When the python process returned outputs back to the frontend, the frontend was not using the provided HTTP error code (always returned 200) -* For all inference backends, we now rely on the tokenizer created by the engine for all processing. Previously, there were some cases where we created a separate tokenizer for processing. -* Enabled specified Java logging level to apply to Python process log level - * For example, if you set `SERVING_OPTS="-Dai.djl.logging.level=debug"`, this will also enable debug level logging on the python code -* Improved validation logic on request schema and improved returned validation exception messages -* Added requestId logging for better per-request visibility and debugging -* Fixed a race condition that could result in a model worker dying for seemingly no reason - * If a request resulted in an error such that the python process was restarted, during the restart it was possible for a new request to trump - the restart process. As a result, the frontend lost knowledge of the restart progress and would shut down the worker after `model_loading_timeout` seconds. +#### LMI Container (vllm, lmi-dist) - Release 11-23-2024 +* vLLM updated to version 0.6.3.post1 +* Support for SageMaker Fast Model Loading: https://aws.amazon.com/blogs/machine-learning/introducing-fast-model-loader-in-sagemaker-inference-accelerate-autoscaling-for-your-large-language-models-llms-part-1/ +* Support for Multi-Lora Inference natively on SageMaker: https://aws.amazon.com/blogs/machine-learning/easily-deploy-and-manage-hundreds-of-lora-adapters-with-sagemaker-efficient-multi-adapter-inference/ -#### LMI Container (vllm, lmi-dist) - Release 10-28-2024 -* vLLM updated to version 0.6.2 -* Added support for new multi-modal models including pixtral and Llama3.2 -* Added support for Tensor Parallel + Pipeline Parallel execution to support multi-node inference -* Various performance improvements and enhacements for the lmi-dist engine -* Please make note of specific behavior changes documented in the [breaking changes](../announcements/breaking_changes.md) section. +#### TensorRT-LLM Container - Coming Soon -#### TensorRT-LLM Container - Release 11-15-2024 -* TensorRT-LLM updated to version 0.12.0 -* Support for Llama3.1 models -* Please make note of specific behavior changes documented in the [breaking changes](../announcements/breaking_changes.md) section. - - -#### Transformers NeuronX Container - Release 11-20-2024 -* Neuron artifacts are updated to 2.20.1 -* Transformers neuronx is updated to 0.12.313 -* Vllm is updated to 0.6.2 -* Compilation time improvement. HF model can directly be loaded into NeuronAutoModel, so split and save step is no longer needed. - - -#### Text Embedding (using the LMI container) -* Various performance improvements +#### Transformers NeuronX Container - Coming Soon diff --git a/serving/docs/lmi/user_guides/starting-guide.md b/serving/docs/lmi/user_guides/starting-guide.md index 2d82ea17a..4d1ee53ea 100644 --- a/serving/docs/lmi/user_guides/starting-guide.md +++ b/serving/docs/lmi/user_guides/starting-guide.md @@ -2,19 +2,10 @@ Most models can be served using the single `HF_MODEL_ID=` environment variable. However, some models require additional configuration. -You can refer to our example notebooks [here](https://github.com/deepjavalibrary/djl-demo/tree/master/aws/sagemaker/large-model-inference/sample-llm) for model specific examples. +You can refer to our example notebooks [here](https://github.com/deepjavalibrary/djl-demo/tree/master/aws/sagemaker/large-model-inference/sample-llm) for model-specific examples. If you are unable to deploy a model using just `HF_MODEL_ID`, and there is no example in the notebook repository, please cut us a Github issue so we can investigate and help. -Based on the selected container, LMI will automatically: - -* select the best backend based on the model architecture -* enable continuous batching if supported for the model architecture to increase throughput -* configure the engine and operation mode -* maximize hardware use through tensor parallelism -* calculate maximum possible tokens and allocate the KV-Cache -* enable CUDA kernels and optimizations based on the available hardware and drivers - The following code example demonstrates this configuration UX using the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk). This example will use the [Llama 3.1 8b Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model. @@ -54,11 +45,11 @@ outputs = predictor.predict({ ## Supported Model Architectures -If you are deploying with the LMI container (e.g. `763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124`), you can find the list of supported models [here](lmi-dist_user_guide.md#supported-model-architectures). +If you are deploying with the LMI container (e.g. `763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124`), you can find the list of supported models [here](lmi-dist_user_guide.md#supported-model-architectures). If you are deploying with the LMI-TRT container (e.g. `763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-tensorrtllm0.12.0-cu125`), you can find the list of supported models [here](trt_llm_user_guide.md#supported-model-architectures). -If you are deploying with the LMI-Neuron container (e.g. `763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1`), you can find the list of supported models [here](tnx_user_guide.md#supported-model-architecture). +If you are deploying with the LMI-Neuron container (e.g. `763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-neuronx-sdk2.20.1`), you can find the list of supported models [here](tnx_user_guide.md#supported-model-architecture). ## Available Environment Variable Configurations @@ -67,7 +58,7 @@ The following environment variables are exposed as part of this simplified UX: **HF_MODEL_ID** This configuration is used to specify the location of your model artifacts. -It can either be a HuggingFace Hub model-id (e.g. TheBloke/Llama-2-7B-fp16), a S3 uri (e.g. s3://my-bucket/my-model/), or a local path. +It can either be a HuggingFace Hub model-id (e.g.meta-llama/Meta-Llama-3.1-8B-Instruct), a S3 uri (e.g. s3://my-bucket/my-model/), or a local path. If you are using [SageMaker's capability to specify uncompressed model artifacts](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-uncompressed.html), you should set this value to `/opt/ml/model`. `/opt/ml/model` is the path in the container where model artifacts are mounted if using this mechanism. diff --git a/serving/docs/lmi/user_guides/vllm_user_guide.md b/serving/docs/lmi/user_guides/vllm_user_guide.md index 5206eba0f..a584b807e 100644 --- a/serving/docs/lmi/user_guides/vllm_user_guide.md +++ b/serving/docs/lmi/user_guides/vllm_user_guide.md @@ -8,11 +8,11 @@ vLLM expects the model artifacts to be in the [standard HuggingFace format](../d **Text Generation Models** -Here is the list of text generation models supported in [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/models/supported_models.html#decoder-only-language-models). +Here is the list of text generation models supported in [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/models/supported_models.html#decoder-only-language-models). **Multi Modal Models** -Here is the list of multi-modal models supported in [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/models/supported_models.html#decoder-only-language-models). +Here is the list of multi-modal models supported in [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/models/supported_models.html#decoder-only-language-models). ### Model Coverage in CI @@ -34,7 +34,7 @@ The following set of models are tested in our nightly tests ## Quantization Support -The quantization techniques supported in vLLM 0.6.2 are listed [here](https://docs.vllm.ai/en/v0.6.2/quantization/supported_hardware.html). +The quantization techniques supported in vLLM 0.6.3.post1 are listed [here](https://docs.vllm.ai/en/v0.6.3.post1/quantization/supported_hardware.html). We highly recommend that regardless of which quantization technique you are using that you pre-quantize the model. Runtime quantization adds additional overhead to the endpoint startup time, and depending on the quantization technique, this can be significant overhead. @@ -47,7 +47,7 @@ The following quantization techniques are supported for runtime quantization: You can leverage these techniques by specifying `option.quantize=` in serving.properties, or `OPTION_QUANTIZE=` environment variable. Other quantization techniques supported by vLLM require ahead of time quantization to be served with LMI. -You can find details on how to leverage those quantization techniques from the vLLM docs [here](https://docs.vllm.ai/en/v0.6.2/quantization/supported_hardware.html). +You can find details on how to leverage those quantization techniques from the vLLM docs [here](https://docs.vllm.ai/en/v0.6.3.post1/quantization/supported_hardware.html). ## Quick Start Configurations