upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064
+21
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I upgraded to the newest version of vllm (0.7.0), which includes an alpha version of their substantially faster engine and a refactor to model configuration. If people are reusing these examples for demos and projects, this should be helpful.
Big speedup, especially at high concurrency. Here's some numbers from testing with Llama3-70B-fp8 on 1 H100:
vllm==0.7.0, VLLM_USE_V1=1
vllm=0.6.3post1
see full results: https://gist.github.com/2timesjay/ebc7773aa8fb01115172f37dae86bc47
Type of Change
Checklist
(all of these are satisfied by keeping the changes to a minimum)
lambda-test: false
is added to example frontmatter (---
)modal run
or an alternativecmd
is provided in the example frontmatter (e.g.cmd: ["modal", "deploy"]
)args
are provided in the example frontmatter (e.g.args: ["--prompt", "Formula for room temperature superconductor:"]
latest
python_version
for the base image, if it is used~=x.y.z
or==x.y
version < 1
are pinned to patch version,==0.y.z
Outside contributors
Jacob Jensen (2timesjay)