Support vllm runtime #608

zhuangqh · 2024-09-28T00:10:31Z

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Today KAITO supports the popular huggingface runtime. We should support other runtime like vllm

Describe alternatives you've considered

Additional context

zhuangqh · 2024-10-14T22:26:09Z

Motivation

Nowadays, KAITO use the huggingface transformers runtime to build the inference and tuning service. It offers an out-of-the-box experience for nearly all transformer-based models hosted on Hugging Face.

However, there are other LLM inference libraries like vLLM and TensorRT-LLM, which pay more attention to inference performance and resource efficiency. Many of our users will use these libraries as their inference engine.

Goals

Set vllm as the default runtime.
Support OpenAI-compatible serving API.

Non-Goals

Unified the args and API request parameters for huggingface and vllm runtime.

Design Details

Inference server API

inference api

health check api

/healthz

metric

/metrics: provide prometheus-style endpoint

Workspace CRD change

Change the default runtime from huggingface/transformers to vllm. Considering the compatibility for out-of-tree models, provide an annotation that allows the user to fall back to the huggingface/transformers runtime.

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: workspace-phi-3-mini
  annotations:
    workspace.kaito.io/runtime: "transformers"
resource:
  instanceType: "Standard_NC6s_v3"
  labelSelector:
    matchLabels:
      apps: phi-3
inference:
  preset:
    name: phi-3-mini-4k-instruct

Engine default parameter

Choose better default engine arguments for user.

model=/workspace/tfs/weights: load default offline model
dtype=float64. default data type for
cpu-offload-gb=0 ?
gpu-memory-utilization=0.9 ?
swap-space=4 ?

notes: https://docs.vllm.ai/en/latest/models/engine_args.html

TODO

implement inference API by using vllm feat: implement inference server by using vllm #624 feat: support adaptive max_model_len #657 fix: binary search for best context length avoiding oom #705 feat: bump accelerate to 1.0.0 #739
update preset-test to support vllm feat: add preset test for vllm #694
build preset images feat: package vllm runtime into image #655
make vllm as the default option for workload. provide fall back annotation feat: support vllm in controller #635
update e2e test to support vllm
support lora adapter

Appendix

Support matrix

	huggingface	vLLM	TensorRT-LLM
support models	272	78	54
notes	https://huggingface.co/docs/transformers/index#supported-models-and-frameworks	https://docs.vllm.ai/en/latest/models/supported_models.html	https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html

Performance benchmark

At the end of this blog.

Out of Memory problem

Start a vllm inference server with zero cpu memory swap space.

python ./inference_api_vllm.py --swap-space 0

Make a request with large sequences.

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:5000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

completion = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "What is kubernetes?"
        }
    ],
    n=10000,
)

print(completion.choices[0].message)

Server will exit with error.

INFO 10-14 11:24:04 logger.py:36] Received request chat-90e9bde7074e402bb284fd0ab0c7d7e8: prompt: '<s><|user|>\nWhat is kubernetes?<|end|>\n<|assistant|>\n', params: SamplingParams(n=100, best_of=100, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4087, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 32010, 1724, 338, 413, 17547, 29973, 32007, 32001], lora_request: None, prompt_adapter_request: None.
INFO 10-14 11:24:04 engine.py:288] Added request chat-90e9bde7074e402bb284fd0ab0c7d7e8.
WARNING 10-14 11:24:09 scheduler.py:1439] Sequence group chat-90e9bde7074e402bb284fd0ab0c7d7e8 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
CRITICAL 10-14 11:24:09 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO:     ::1:58040 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-14 11:24:09 engine.py:157] RuntimeError('Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.')

Start the server with larger swap-space.

python ./inference_api_vllm.py --swap-space 8

The request was successfully processed.

INFO 10-14 11:28:42 logger.py:36] Received request chat-f9f440781d3a45e5be01e8f3fd16f661: prompt: '<s><|user|>\nWhat is kubernetes?<|end|>\n<|assistant|>\n', params: SamplingParams(n=100, best_of=100, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4087, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 32010, 1724, 338, 413, 17547, 29973, 32007, 32001], lora_request: None, prompt_adapter_request: None.
INFO 10-14 11:28:42 engine.py:288] Added request chat-f9f440781d3a45e5be01e8f3fd16f661.
WARNING 10-14 11:28:47 scheduler.py:1439] Sequence group chat-f9f440781d3a45e5be01e8f3fd16f661 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
WARNING 10-14 11:28:48 scheduler.py:691] Failing the request chat-f9f440781d3a45e5be01e8f3fd16f661 because there's not enough kv cache blocks to run the entire sequence.
INFO:     ::1:50168 - "POST /v1/chat/completions HTTP/1.1" 200 OK

zawachte · 2024-11-22T02:17:09Z

Any interest in supporting ollama? Would be good for folks trying to play around with kaito that have no gpu quota 😄.

zhuangqh · 2024-11-26T00:31:13Z

Any interest in supporting ollama? Would be good for folks trying to play around with kaito that have no gpu quota 😄.

Currently, we are focusing on GPU support.😊

zhuangqh added the enhancement New feature or request label Sep 28, 2024

zhuangqh moved this to In Progress in Kaito Roadmap Sep 28, 2024

zhuangqh added this to Kaito Roadmap Sep 28, 2024

zhuangqh mentioned this issue Oct 14, 2024

feat: implement inference server by using vllm #624

Merged

1 task

zhuangqh moved this to In Progress in Kaito Nov 4, 2024

zhuangqh added this to Kaito Nov 4, 2024

zhuangqh changed the title ~~Support more llm runtime~~ Support vllm runtime Dec 5, 2024

zhuangqh closed this as completed Dec 6, 2024

github-project-automation bot moved this from In Progress to Done in Kaito Dec 6, 2024

github-project-automation bot moved this from In Progress to Done in Kaito Roadmap Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vllm runtime #608

Support vllm runtime #608

zhuangqh commented Sep 28, 2024

zhuangqh commented Oct 14, 2024 •

edited

Loading

zawachte commented Nov 22, 2024

zhuangqh commented Nov 26, 2024

Support vllm runtime #608

Support vllm runtime #608

Comments

zhuangqh commented Sep 28, 2024

zhuangqh commented Oct 14, 2024 • edited Loading

Motivation

Goals

Non-Goals

Design Details

Inference server API

Workspace CRD change

Engine default parameter

TODO

Appendix

Support matrix

Performance benchmark

Out of Memory problem

zawachte commented Nov 22, 2024

zhuangqh commented Nov 26, 2024

zhuangqh commented Oct 14, 2024 •

edited

Loading