Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

2timesjay
Copy link

@2timesjay 2timesjay commented Feb 3, 2025

I upgraded to the newest version of vllm (0.7.0), which includes an alpha version of their substantially faster engine and a refactor to model configuration. If people are reusing these examples for demos and projects, this should be helpful.

Big speedup, especially at high concurrency. Here's some numbers from testing with Llama3-70B-fp8 on 1 H100:

vllm==0.7.0, VLLM_USE_V1=1

Max Parallelism Number of Prompts Average Latency (s) p95 Latency (s) Throughput (requests/s)
8 32 3.3245 3.5712 2.3654
16 32 3.7085 3.8151 4.2802
32 32 4.5872 4.6662 6.8342
64 64 5.8669 6.0471 10.4833
128 128 8.6023 8.8094 14.3457
256 256 14.7483 18.9714 13.2442

vllm=0.6.3post1

Max Parallelism Number of Prompts Average Latency (s) p95 Latency (s) Throughput (requests/s)
8 32 4.1822 4.3813 1.9079
16 32 4.6502 5.1558 3.1282
32 32 6.9919 9.2724 3.4463
64 64 11.4092 18.2382 3.4930
128 128 75.9388 90.1596 1.4170
256 256 93.1698 123.8264 2.0409

see full results: https://gist.github.com/2timesjay/ebc7773aa8fb01115172f37dae86bc47

Type of Change

  • New example
  • Example updates (Bug fixes, new features, etc.)
  • Other (changes to the codebase, but not to examples)

Checklist

(all of these are satisfied by keeping the changes to a minimum)

  • Example is testable in synthetic monitoring system, or lambda-test: false is added to example frontmatter (---)
    • Example is tested by executing with modal run or an alternative cmd is provided in the example frontmatter (e.g. cmd: ["modal", "deploy"])
    • Example is tested by running with no arguments or the args are provided in the example frontmatter (e.g. args: ["--prompt", "Formula for room temperature superconductor:"]
  • Example is documented with comments throughout, in a Literate Programming style.
  • Example does not require third-party dependencies to be installed locally
  • Example pins its dependencies
    • Example pins container images to a stable tag, not a dynamic tag like latest
    • Example specifies a python_version for the base image, if it is used
    • Example pins all dependencies to at least minor version, ~=x.y.z or ==x.y
    • Example dependencies with version < 1 are pinned to patch version, ==0.y.z

Outside contributors

Jacob Jensen (2timesjay)

@charlesfrye
Copy link
Collaborator

Thanks for the PR! cc @jackcook

@jackcook
Copy link

jackcook commented Feb 4, 2025

Looks good to me! I did some benchmarking today to look at the effects of the new V1 engine and we're seeing similar improvements internally as well.

@bhaktatejas922
Copy link

theres a fair amount of things not implemented yet on v1, could be worth adding a disclaimer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants