[lmi][docs] minor doc updates for latest 0.31.0 release (#2649)

deepjavalibrary · Dec 30, 2024 · b836683 · b836683
1 parent d37a465
commit b836683
Show file tree

Hide file tree

Showing 8 changed files with 134 additions and 85 deletions.
diff --git a/serving/docs/lmi/README.md b/serving/docs/lmi/README.md
@@ -30,7 +30,7 @@ LMI containers provide many features, including:
 LMI containers provide these features through integrations with popular inference libraries.
 A unified configuration format enables users to easily leverage the latest optimizations and technologies across libraries.
 We will refer to each of these libraries as `backends` throughout the documentation. 
-The term backend refers to a combination of Engine (LMI uses the Python Engine) and inference library.
+The term backend refers to a combination of Engine (LMI uses the Python Engine) and inference library (like vLLM).
 You can learn more about the components of LMI [here](deployment_guide/README.md#components-of-lmi).
 
 ## QuickStart
@@ -74,11 +74,10 @@ This information is also available on the SageMaker DLC [GitHub repository](http
 
 | Backend                | SageMakerDLC    | Example URI                                                                               |
 |------------------------|-----------------|-------------------------------------------------------------------------------------------|
-| `vLLM`                 | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124         |
-| `lmi-dist`             | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124         |
-| `hf-accelerate`        | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124         |
+| `vLLM`                 | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124         |
+| `lmi-dist`             | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124         |
 | `tensorrt-llm`         | djl-tensorrtllm | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-tensorrtllm0.12.0-cu125 |
-| `transformers-neuronx` | djl-neuronx     | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1       |
+| `transformers-neuronx` | djl-neuronx     | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-neuronx-sdk2.20.1       |
 
 ## Advanced Features
 
@@ -92,6 +91,3 @@ LMI contains also contain several advanced features that can be used for more co
 The LMI team maintains sample SageMaker notebooks in the [djl-demo repository](https://github.com/deepjavalibrary/djl-demo/tree/master/aws/sagemaker/large-model-inference/sample-llm).
 This repository contains the most up-to-date notebooks for LMI.
 Notebooks are updated with every release, and new notebooks are added to demonstrate new features and capabilities.
-
-Additionally, the [SageMaker GenAI Hosting Examples](https://github.com/aws-samples/sagemaker-genai-hosting-examples) repository contains additional examples.
-However, the notebooks here are not updated as frequently and may be stale.
diff --git a/serving/docs/lmi/user_guides/chat_input_output_schema.md b/serving/docs/lmi/user_guides/chat_input_output_schema.md
@@ -9,13 +9,15 @@ If the request contains the "messages" field, LMI will treat the request as a ch
 back with the chat completions response style.
 
 When using the Chat Completions Schema, you should make sure that the model you are serving has a chat template.
-The chat template ensures that the messages object is tokenized appropriately for your model.
+The chat template ensures that the payload is tokenized appropriately for your model.
 See [the HuggingFace documentation on chat templates](https://huggingface.co/docs/transformers/main/en/chat_templating) for more information.
 
 This processing happens per request, meaning that you can support our [standard schema](lmi_input_output_schema.md),
 as well as chat completions schema in the same endpoint.
 
 Note: This is an experimental feature. The complete spec has not been implemented.
+In particular, function calling is not currently available.
+We are targeting function calling support in our next release, 0.32.0, planned for January 2025.
 
 ## Request Schema
 
@@ -116,9 +118,67 @@ The response is returned token by token as application/jsonlines content-type:
 Example response:
 
 ```
-{"id": "chatcmpl-0", "object": "chat.completion.chunk", "created": 1712792433, "choices": [{"index": 0, "delta": {"content": " Oh", "role": "assistant"}, "logprobs": [{"content": [{"token": " Oh", "logprob": -4.499478340148926, "bytes": [32, 79, 104], "top_logprobs": [{"token": -4.499478340148926, "logprob": -4.499478340148926, "bytes": [32, 79, 104]}]}]}], "finish_reason": null}]}
+{
+  "id": "chatcmpl-0", 
+  "object": "chat.completion.chunk", 
+  "created": 1712792433, 
+  "choices": [
+    {
+      "index": 0, 
+      "delta": {"content": " Oh", "role": "assistant"}, 
+      "logprobs": [
+        {
+          "content": [
+            {
+              "token": " Oh", 
+              "logprob": -4.499478340148926, 
+              "bytes": [32, 79, 104], 
+              "top_logprobs": [
+                {
+                  "token": -4.499478340148926, 
+                  "logprob": -4.499478340148926, 
+                  "bytes": [32, 79, 104]
+                }
+              ]
+            }
+          ]
+        }
+      ], 
+      "finish_reason": null
+    }
+  ]
+}
 ...
-{"id": "chatcmpl-0", "object": "chat.completion.chunk", "created": 1712792436, "choices": [{"index": 0, "delta": {"content": " assist"}, "logprobs": [{"content": [{"token": " assist", "logprob": -1.019672155380249, "bytes": [32, 97, 115, 115, 105, 115, 116], "top_logprobs": [{"token": -1.019672155380249, "logprob": -1.019672155380249, "bytes": [32, 97, 115, 115, 105, 115, 116]}]}]}], "finish_reason": "length"}]}
+{
+  "id": "chatcmpl-0", 
+  "object": "chat.completion.chunk", 
+  "created": 1712792436, 
+  "choices": [
+    {
+      "index": 0, 
+      "delta": {"content": " assist"}, 
+      "logprobs": [
+        {
+          "content": [
+            {
+              "token": " assist", 
+              "logprob": -1.019672155380249, 
+              "bytes": [32, 97, 115, 115, 105, 115, 116], 
+              "top_logprobs": [
+                {
+                  "token": -1.019672155380249, 
+                  "logprob": -1.019672155380249, 
+                  "bytes": [32, 97, 115, 115, 105, 115, 116]
+                }
+              ]
+            }
+          ]
+        }
+      ], 
+      "finish_reason": "length"
+    }
+  ]
+}
 ```
 
 ## API Object Schemas
@@ -146,7 +206,6 @@ Example:
 
 #### Vision/Image Support
 
-Starting in v0.29.0, we have added experimental support for vision language models. 
 You can specify an image as part of the content when using a vision language model.
 Image data can either be specified as a url, or via a base64 encoding of the image data.
 
@@ -171,18 +230,19 @@ Example:
 ```
 
 We recommend that you use the base64 encoding to ensure no network failures occur when retrieving the image within the endpoint.
+Network calls to fetch images can increase latency and introduce another failure point.
 
 ### Choice
 
 The choice object represents a chat completion choice.
 It contains the following fields:
 
-| Field Name      | Type                  | Description                                       | Example                                   |
-|-----------------|-----------------------|---------------------------------------------------|-------------------------------------------|
-| `index`         | int                   | The index of the choice                           | 0                                         |
-| `message`       | [Message](#message)   | A chat completion message generated by the model. | See the [Message](#message) documentation |
-| `logprobs`      | [Logprobs](#logprobs) | The log probability of the token                  | See the [Logprobs](#logprob) documentation                                     |
-| `finish_reason` | string enum           | The reason the model stopped generating tokens    | "length", "eos_token", "stop_sequence"    |
+| Field Name      | Type                  | Description                                       | Example                                    |
+|-----------------|-----------------------|---------------------------------------------------|--------------------------------------------|
+| `index`         | int                   | The index of the choice                           | 0                                          |
+| `message`       | [Message](#message)   | A chat completion message generated by the model. | See the [Message](#message) documentation  |
+| `logprobs`      | [Logprobs](#logprobs) | The log probability of the token                  | See the [Logprobs](#logprob) documentation |
+| `finish_reason` | string enum           | The reason the model stopped generating tokens    | "length", "eos_token", "stop_sequence"     |
 
 Example:
 
@@ -213,7 +273,28 @@ It contains the following fields:
 Example:
 
 ```
-{"index": 0, "delta": {"content": " Oh", "role": "assistant"}, "logprobs": [{"content": [{"token": " Oh", "logprob": -4.499478340148926, "bytes": [32, 79, 104], "top_logprobs": [{"token": -4.499478340148926, "logprob": -4.499478340148926, "bytes": [32, 79, 104]}]}]}
+{
+  "index": 0, 
+  "delta": {"content": " Oh", "role": "assistant"}, 
+  "logprobs": [
+    {
+      "content": [
+        {
+          "token": " Oh", 
+          "logprob": -4.499478340148926, 
+          "bytes": [32, 79, 104], 
+          "top_logprobs": [
+            {
+              "token": -4.499478340148926, 
+              "logprob": -4.499478340148926, 
+              "bytes": [32, 79, 104]
+            }
+          ]
+        }
+      ]
+    }
+  ]
+}
 ```
 
 ### Logprobs

diff --git a/serving/docs/lmi/user_guides/embedding-user-guide.md b/serving/docs/lmi/user_guides/embedding-user-guide.md
@@ -18,7 +18,7 @@ LMI supports Text Embedding Inference with the following engines:
 - Rust
 - Python
 
-Currently, the OnnxRuntime engine provides the best performance for text embedding in LMI. 
+Currently, the Rust engine provides the best performance for text embedding in LMI. 
 
 ## Quick Start Configurations
 
@@ -36,11 +36,11 @@ SERVING_BATCH_SIZE=32
 
 ### environment variables
 
-You can specify the `HF_MODEL_ID` environment variable to load a model from HuggingFace hub. DJLServing
-will download the model from HuggingFace hub and optimize the model with OnnxRuntime at runtime.
+You can specify the `HF_MODEL_ID` environment variable to load a model from HuggingFace hub, DJL Model Zoo, AWS S3, or a local path. 
+DJLServing will download the model from HuggingFace hub and optimize the model with the selected engine at runtime.
 
 ```
-OPTION_ENGINE=OnnxRuntime
+OPTION_ENGINE=Rust
 HF_MODEL_ID=BAAI/bge-base-en-v1.5
 # Optional
 SERVING_BATCH_SIZE=32
@@ -52,7 +52,7 @@ to deploy a model with environment variable configuration on SageMaker.
 ### serving.properties
 
 ```
-engine=OnnxRuntime
+engine=Rust
 option.model_id=BAAI/bge-base-en-v1.5
 translatorFactory=ai.djl.huggingface.translator.TextEmbeddingTranslatorFactory
 # Optional

diff --git a/serving/docs/lmi/user_guides/lmi-dist_user_guide.md b/serving/docs/lmi/user_guides/lmi-dist_user_guide.md
@@ -8,16 +8,16 @@ LMI-Dist expects the model to be in the [standard HuggingFace format](../deploym
 
 **Text Generation Models**
 
-LMI-Dist supports the same set of text-generation models as [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/models/supported_models.html#decoder-only-language-models).
+LMI-Dist supports the same set of text-generation models as [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/models/supported_models.html#decoder-only-language-models).
 
 In addition to the vllm models, LMI-Dist also supports the t5 model family (e.g. google/flan-t5-xl).
 
 **Multi Modal Models**
 
-LMI-Dist supports the same set of multi-modal models as [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/models/supported_models.html#decoder-only-language-models).
+LMI-Dist supports the same set of multi-modal models as [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/models/supported_models.html#decoder-only-language-models).
 
 However, the one known exception is MLlama (Llama3.2 multimodal models). 
-MLlama support is expected in the v13 (0.31.0) release.
+MLlama support is expected in the v13 (0.32.0) release.
 
 ### Model Coverage in CI
 
@@ -92,7 +92,7 @@ Please check that your base model [supports LoRA adapters in vLLM](https://docs.
 
 ## Quantization Support
 
-LMI-Dist supports the same quantization techniques as [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/quantization/supported_hardware.html).
+LMI-Dist supports the same quantization techniques as [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/quantization/supported_hardware.html).
 
 We highly recommend that regardless of which quantization technique you are using that you pre-quantize the model.
 Runtime quantization adds additional overhead to the endpoint startup time, and depending on the quantization technique, this can be significant overhead.

diff --git a/serving/docs/lmi/user_guides/lmi_input_output_schema.md b/serving/docs/lmi/user_guides/lmi_input_output_schema.md
@@ -1,7 +1,7 @@
 # LMI handlers Inference API Schema
 
 This document provides the default API schema for the inference endpoints (`/invocations`, `/predictions/<model_name>`) when using the built-in inference handlers in LMI containers.
-This schema is applicable to our latest release, v0.28.0.
+This schema is applicable to our latest release, v0.31.0.
 Documentation for previous releases is available on our GitHub on the relevant version branch (e.g. 0.27.0-dlc).
 
 LMI provides two distinct schemas depending on what type of batching you use:
@@ -41,7 +41,7 @@ curl -X POST https://my.sample.endpoint.com/invocations \
 
 ### Response Schema
 
-When not using streaming (this is the default), the response is returned as application/json content-type:
+When not using streaming (the default), the response is returned as application/json content-type:
 
 | Field Name       | Field Type          | Always Returned                           | Possible Values                                                                     | 
 |------------------|---------------------|-------------------------------------------|-------------------------------------------------------------------------------------|
@@ -84,7 +84,10 @@ Example response:
 }
 ```
 
-When using streaming, if you want Server Side Events, then you could use `option.output_formatter=sse`. If you `stream=True`, the default `output_formatter` is `jsonlines`. So you would want to explicitly provide `option.output_formatter=sse` when you want SSE with streaming. Check out `TGI_COMPAT` option below, enabling that option will make SSE as the default formatter with streaming. 
+When using streaming, if you want Server Side Events, then you could use `option.output_formatter=sse`. 
+If you `stream=True`, the default `output_formatter` is `jsonlines`. 
+So you would want to explicitly provide `option.output_formatter=sse` when you want SSE with streaming. 
+Check out `TGI_COMPAT` option below, enabling that option will make SSE as the default formatter with streaming. 
 When using SSE the jsonline will have the prefix `data`. 
 
 Example response:
@@ -103,7 +106,7 @@ data:{
 
 #### Error Responses
 
-Errors can typically happen in 2 places:
+Errors can typically happen in two places:
 
 - Before inference has started
 - During token generation (in the middle of inference)
@@ -154,7 +157,8 @@ When using streaming:
 
 ## Response with TGI compatibility
 
-In order to get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true` or `option.tgi_compat=true` in your serving.properties. Right now, DJLServing for LMI with rolling batch has minor differences in the response schema compared to TGI. 
+In order to get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true` or `option.tgi_compat=true` in your serving.properties. 
+Right now, DJLServing for LMI with rolling batch has minor differences in the response schema compared to TGI. 
 
 This feature is designed for customers transitioning from TGI, making their lives easier by allowing them to continue using their client-side code without any special modifications for our LMI containers or DJLServing.
 Enabling the tgi_compat option would make the response look like below:
@@ -418,7 +422,10 @@ Example:
 
 ### BestOfSequence
 
-Generated text and its details is the one with the highest log probability. Others sequences are returned as best_of_sequences. You can enable this with n > 1. It is also returned when beam search is enabled with the option num_beams > 1. 
+Generated text and its details is the one with the highest log probability. 
+Other sequences are returned as best_of_sequences. 
+You can enable this with n > 1. 
+It is also returned when beam search is enabled with the option num_beams > 1. 
 
 Note that best_of_sequences will only work with non-streaming case.
 
@@ -434,4 +441,4 @@ Note that best_of_sequences will only work with non-streaming case.
 
 If you wish to create your own pre-processing and post-processing for our handlers, check out these guides [Custom input format schema guide](input_formatter_schema.md) and [Custom output format schema guide](output_formatter_schema.md).
 
-This is not an officially supported use-case. The API signature, as well as implementation, is subject to change at any time.
+This is an experimental use-case. The API signature, as well as implementation, is subject to change at any time.