Skip to content

Commit

Permalink
[lmi][docs] minor doc updates for latest 0.31.0 release (#2649)
Browse files Browse the repository at this point in the history
  • Loading branch information
siddvenk authored Dec 30, 2024
1 parent d37a465 commit b836683
Show file tree
Hide file tree
Showing 8 changed files with 134 additions and 85 deletions.
12 changes: 4 additions & 8 deletions serving/docs/lmi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ LMI containers provide many features, including:
LMI containers provide these features through integrations with popular inference libraries.
A unified configuration format enables users to easily leverage the latest optimizations and technologies across libraries.
We will refer to each of these libraries as `backends` throughout the documentation.
The term backend refers to a combination of Engine (LMI uses the Python Engine) and inference library.
The term backend refers to a combination of Engine (LMI uses the Python Engine) and inference library (like vLLM).
You can learn more about the components of LMI [here](deployment_guide/README.md#components-of-lmi).

## QuickStart
Expand Down Expand Up @@ -74,11 +74,10 @@ This information is also available on the SageMaker DLC [GitHub repository](http

| Backend | SageMakerDLC | Example URI |
|------------------------|-----------------|-------------------------------------------------------------------------------------------|
| `vLLM` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124 |
| `lmi-dist` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124 |
| `hf-accelerate` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124 |
| `vLLM` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124 |
| `lmi-dist` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124 |
| `tensorrt-llm` | djl-tensorrtllm | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-tensorrtllm0.12.0-cu125 |
| `transformers-neuronx` | djl-neuronx | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1 |
| `transformers-neuronx` | djl-neuronx | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-neuronx-sdk2.20.1 |

## Advanced Features

Expand All @@ -92,6 +91,3 @@ LMI contains also contain several advanced features that can be used for more co
The LMI team maintains sample SageMaker notebooks in the [djl-demo repository](https://github.com/deepjavalibrary/djl-demo/tree/master/aws/sagemaker/large-model-inference/sample-llm).
This repository contains the most up-to-date notebooks for LMI.
Notebooks are updated with every release, and new notebooks are added to demonstrate new features and capabilities.

Additionally, the [SageMaker GenAI Hosting Examples](https://github.com/aws-samples/sagemaker-genai-hosting-examples) repository contains additional examples.
However, the notebooks here are not updated as frequently and may be stale.
103 changes: 92 additions & 11 deletions serving/docs/lmi/user_guides/chat_input_output_schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,15 @@ If the request contains the "messages" field, LMI will treat the request as a ch
back with the chat completions response style.

When using the Chat Completions Schema, you should make sure that the model you are serving has a chat template.
The chat template ensures that the messages object is tokenized appropriately for your model.
The chat template ensures that the payload is tokenized appropriately for your model.
See [the HuggingFace documentation on chat templates](https://huggingface.co/docs/transformers/main/en/chat_templating) for more information.

This processing happens per request, meaning that you can support our [standard schema](lmi_input_output_schema.md),
as well as chat completions schema in the same endpoint.

Note: This is an experimental feature. The complete spec has not been implemented.
In particular, function calling is not currently available.
We are targeting function calling support in our next release, 0.32.0, planned for January 2025.

## Request Schema

Expand Down Expand Up @@ -116,9 +118,67 @@ The response is returned token by token as application/jsonlines content-type:
Example response:

```
{"id": "chatcmpl-0", "object": "chat.completion.chunk", "created": 1712792433, "choices": [{"index": 0, "delta": {"content": " Oh", "role": "assistant"}, "logprobs": [{"content": [{"token": " Oh", "logprob": -4.499478340148926, "bytes": [32, 79, 104], "top_logprobs": [{"token": -4.499478340148926, "logprob": -4.499478340148926, "bytes": [32, 79, 104]}]}]}], "finish_reason": null}]}
{
"id": "chatcmpl-0",
"object": "chat.completion.chunk",
"created": 1712792433,
"choices": [
{
"index": 0,
"delta": {"content": " Oh", "role": "assistant"},
"logprobs": [
{
"content": [
{
"token": " Oh",
"logprob": -4.499478340148926,
"bytes": [32, 79, 104],
"top_logprobs": [
{
"token": -4.499478340148926,
"logprob": -4.499478340148926,
"bytes": [32, 79, 104]
}
]
}
]
}
],
"finish_reason": null
}
]
}
...
{"id": "chatcmpl-0", "object": "chat.completion.chunk", "created": 1712792436, "choices": [{"index": 0, "delta": {"content": " assist"}, "logprobs": [{"content": [{"token": " assist", "logprob": -1.019672155380249, "bytes": [32, 97, 115, 115, 105, 115, 116], "top_logprobs": [{"token": -1.019672155380249, "logprob": -1.019672155380249, "bytes": [32, 97, 115, 115, 105, 115, 116]}]}]}], "finish_reason": "length"}]}
{
"id": "chatcmpl-0",
"object": "chat.completion.chunk",
"created": 1712792436,
"choices": [
{
"index": 0,
"delta": {"content": " assist"},
"logprobs": [
{
"content": [
{
"token": " assist",
"logprob": -1.019672155380249,
"bytes": [32, 97, 115, 115, 105, 115, 116],
"top_logprobs": [
{
"token": -1.019672155380249,
"logprob": -1.019672155380249,
"bytes": [32, 97, 115, 115, 105, 115, 116]
}
]
}
]
}
],
"finish_reason": "length"
}
]
}
```

## API Object Schemas
Expand Down Expand Up @@ -146,7 +206,6 @@ Example:

#### Vision/Image Support

Starting in v0.29.0, we have added experimental support for vision language models.
You can specify an image as part of the content when using a vision language model.
Image data can either be specified as a url, or via a base64 encoding of the image data.

Expand All @@ -171,18 +230,19 @@ Example:
```

We recommend that you use the base64 encoding to ensure no network failures occur when retrieving the image within the endpoint.
Network calls to fetch images can increase latency and introduce another failure point.

### Choice

The choice object represents a chat completion choice.
It contains the following fields:

| Field Name | Type | Description | Example |
|-----------------|-----------------------|---------------------------------------------------|-------------------------------------------|
| `index` | int | The index of the choice | 0 |
| `message` | [Message](#message) | A chat completion message generated by the model. | See the [Message](#message) documentation |
| `logprobs` | [Logprobs](#logprobs) | The log probability of the token | See the [Logprobs](#logprob) documentation |
| `finish_reason` | string enum | The reason the model stopped generating tokens | "length", "eos_token", "stop_sequence" |
| Field Name | Type | Description | Example |
|-----------------|-----------------------|---------------------------------------------------|--------------------------------------------|
| `index` | int | The index of the choice | 0 |
| `message` | [Message](#message) | A chat completion message generated by the model. | See the [Message](#message) documentation |
| `logprobs` | [Logprobs](#logprobs) | The log probability of the token | See the [Logprobs](#logprob) documentation |
| `finish_reason` | string enum | The reason the model stopped generating tokens | "length", "eos_token", "stop_sequence" |

Example:

Expand Down Expand Up @@ -213,7 +273,28 @@ It contains the following fields:
Example:

```
{"index": 0, "delta": {"content": " Oh", "role": "assistant"}, "logprobs": [{"content": [{"token": " Oh", "logprob": -4.499478340148926, "bytes": [32, 79, 104], "top_logprobs": [{"token": -4.499478340148926, "logprob": -4.499478340148926, "bytes": [32, 79, 104]}]}]}
{
"index": 0,
"delta": {"content": " Oh", "role": "assistant"},
"logprobs": [
{
"content": [
{
"token": " Oh",
"logprob": -4.499478340148926,
"bytes": [32, 79, 104],
"top_logprobs": [
{
"token": -4.499478340148926,
"logprob": -4.499478340148926,
"bytes": [32, 79, 104]
}
]
}
]
}
]
}
```

### Logprobs
Expand Down
10 changes: 5 additions & 5 deletions serving/docs/lmi/user_guides/embedding-user-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ LMI supports Text Embedding Inference with the following engines:
- Rust
- Python

Currently, the OnnxRuntime engine provides the best performance for text embedding in LMI.
Currently, the Rust engine provides the best performance for text embedding in LMI.

## Quick Start Configurations

Expand All @@ -36,11 +36,11 @@ SERVING_BATCH_SIZE=32

### environment variables

You can specify the `HF_MODEL_ID` environment variable to load a model from HuggingFace hub. DJLServing
will download the model from HuggingFace hub and optimize the model with OnnxRuntime at runtime.
You can specify the `HF_MODEL_ID` environment variable to load a model from HuggingFace hub, DJL Model Zoo, AWS S3, or a local path.
DJLServing will download the model from HuggingFace hub and optimize the model with the selected engine at runtime.

```
OPTION_ENGINE=OnnxRuntime
OPTION_ENGINE=Rust
HF_MODEL_ID=BAAI/bge-base-en-v1.5
# Optional
SERVING_BATCH_SIZE=32
Expand All @@ -52,7 +52,7 @@ to deploy a model with environment variable configuration on SageMaker.
### serving.properties

```
engine=OnnxRuntime
engine=Rust
option.model_id=BAAI/bge-base-en-v1.5
translatorFactory=ai.djl.huggingface.translator.TextEmbeddingTranslatorFactory
# Optional
Expand Down
8 changes: 4 additions & 4 deletions serving/docs/lmi/user_guides/lmi-dist_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,16 @@ LMI-Dist expects the model to be in the [standard HuggingFace format](../deploym

**Text Generation Models**

LMI-Dist supports the same set of text-generation models as [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/models/supported_models.html#decoder-only-language-models).
LMI-Dist supports the same set of text-generation models as [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/models/supported_models.html#decoder-only-language-models).

In addition to the vllm models, LMI-Dist also supports the t5 model family (e.g. google/flan-t5-xl).

**Multi Modal Models**

LMI-Dist supports the same set of multi-modal models as [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/models/supported_models.html#decoder-only-language-models).
LMI-Dist supports the same set of multi-modal models as [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/models/supported_models.html#decoder-only-language-models).

However, the one known exception is MLlama (Llama3.2 multimodal models).
MLlama support is expected in the v13 (0.31.0) release.
MLlama support is expected in the v13 (0.32.0) release.

### Model Coverage in CI

Expand Down Expand Up @@ -92,7 +92,7 @@ Please check that your base model [supports LoRA adapters in vLLM](https://docs.

## Quantization Support

LMI-Dist supports the same quantization techniques as [vllm 0.6.2](https://docs.vllm.ai/en/v0.6.2/quantization/supported_hardware.html).
LMI-Dist supports the same quantization techniques as [vllm 0.6.3.post1](https://docs.vllm.ai/en/v0.6.3.post1/quantization/supported_hardware.html).

We highly recommend that regardless of which quantization technique you are using that you pre-quantize the model.
Runtime quantization adds additional overhead to the endpoint startup time, and depending on the quantization technique, this can be significant overhead.
Expand Down
21 changes: 14 additions & 7 deletions serving/docs/lmi/user_guides/lmi_input_output_schema.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# LMI handlers Inference API Schema

This document provides the default API schema for the inference endpoints (`/invocations`, `/predictions/<model_name>`) when using the built-in inference handlers in LMI containers.
This schema is applicable to our latest release, v0.28.0.
This schema is applicable to our latest release, v0.31.0.
Documentation for previous releases is available on our GitHub on the relevant version branch (e.g. 0.27.0-dlc).

LMI provides two distinct schemas depending on what type of batching you use:
Expand Down Expand Up @@ -41,7 +41,7 @@ curl -X POST https://my.sample.endpoint.com/invocations \

### Response Schema

When not using streaming (this is the default), the response is returned as application/json content-type:
When not using streaming (the default), the response is returned as application/json content-type:

| Field Name | Field Type | Always Returned | Possible Values |
|------------------|---------------------|-------------------------------------------|-------------------------------------------------------------------------------------|
Expand Down Expand Up @@ -84,7 +84,10 @@ Example response:
}
```

When using streaming, if you want Server Side Events, then you could use `option.output_formatter=sse`. If you `stream=True`, the default `output_formatter` is `jsonlines`. So you would want to explicitly provide `option.output_formatter=sse` when you want SSE with streaming. Check out `TGI_COMPAT` option below, enabling that option will make SSE as the default formatter with streaming.
When using streaming, if you want Server Side Events, then you could use `option.output_formatter=sse`.
If you `stream=True`, the default `output_formatter` is `jsonlines`.
So you would want to explicitly provide `option.output_formatter=sse` when you want SSE with streaming.
Check out `TGI_COMPAT` option below, enabling that option will make SSE as the default formatter with streaming.
When using SSE the jsonline will have the prefix `data`.

Example response:
Expand All @@ -103,7 +106,7 @@ data:{

#### Error Responses

Errors can typically happen in 2 places:
Errors can typically happen in two places:

- Before inference has started
- During token generation (in the middle of inference)
Expand Down Expand Up @@ -154,7 +157,8 @@ When using streaming:

## Response with TGI compatibility

In order to get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true` or `option.tgi_compat=true` in your serving.properties. Right now, DJLServing for LMI with rolling batch has minor differences in the response schema compared to TGI.
In order to get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true` or `option.tgi_compat=true` in your serving.properties.
Right now, DJLServing for LMI with rolling batch has minor differences in the response schema compared to TGI.

This feature is designed for customers transitioning from TGI, making their lives easier by allowing them to continue using their client-side code without any special modifications for our LMI containers or DJLServing.
Enabling the tgi_compat option would make the response look like below:
Expand Down Expand Up @@ -418,7 +422,10 @@ Example:

### BestOfSequence

Generated text and its details is the one with the highest log probability. Others sequences are returned as best_of_sequences. You can enable this with n > 1. It is also returned when beam search is enabled with the option num_beams > 1.
Generated text and its details is the one with the highest log probability.
Other sequences are returned as best_of_sequences.
You can enable this with n > 1.
It is also returned when beam search is enabled with the option num_beams > 1.

Note that best_of_sequences will only work with non-streaming case.

Expand All @@ -434,4 +441,4 @@ Note that best_of_sequences will only work with non-streaming case.

If you wish to create your own pre-processing and post-processing for our handlers, check out these guides [Custom input format schema guide](input_formatter_schema.md) and [Custom output format schema guide](output_formatter_schema.md).

This is not an officially supported use-case. The API signature, as well as implementation, is subject to change at any time.
This is an experimental use-case. The API signature, as well as implementation, is subject to change at any time.
Loading

0 comments on commit b836683

Please sign in to comment.