-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support vllm runtime #608
Comments
MotivationNowadays, KAITO use the huggingface transformers runtime to build the inference and tuning service. It offers an out-of-the-box experience for nearly all transformer-based models hosted on Hugging Face. However, there are other LLM inference libraries like vLLM and TensorRT-LLM, which pay more attention to inference performance and resource efficiency. Many of our users will use these libraries as their inference engine. Goals
Non-Goals
Design DetailsInference server APIinference api health check api
metric
Workspace CRD changeChange the default runtime from huggingface/transformers to vllm. Considering the compatibility for out-of-tree models, provide an annotation that allows the user to fall back to the huggingface/transformers runtime. apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: workspace-phi-3-mini
annotations:
workspace.kaito.io/runtime: "transformers"
resource:
instanceType: "Standard_NC6s_v3"
labelSelector:
matchLabels:
apps: phi-3
inference:
preset:
name: phi-3-mini-4k-instruct Engine default parameterChoose better default engine arguments for user.
notes: https://docs.vllm.ai/en/latest/models/engine_args.html TODO
AppendixSupport matrix
Performance benchmarkAt the end of this blog. Out of Memory problem
Make a request with large sequences. from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:5000/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": "What is kubernetes?"
}
],
n=10000,
)
print(completion.choices[0].message) Server will exit with error.
The request was successfully processed.
|
Any interest in supporting ollama? Would be good for folks trying to play around with kaito that have no gpu quota 😄. |
Currently, we are focusing on GPU support.😊 |
Is your feature request related to a problem? Please describe.
Describe the solution you'd like
Today KAITO supports the popular huggingface runtime. We should support other runtime like vllm
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: