upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064

2timesjay · 2025-02-03T02:44:43Z

I upgraded to the newest version of vllm (0.7.0), which includes an alpha version of their substantially faster engine and a refactor to model configuration. If people are reusing these examples for demos and projects, this should be helpful.

Big speedup, especially at high concurrency. Here's some numbers from testing with Llama3-70B-fp8 on 1 H100:

vllm==0.7.0, VLLM_USE_V1=1

Max Parallelism	Number of Prompts	Average Latency (s)	p95 Latency (s)	Throughput (requests/s)
8	32	3.3245	3.5712	2.3654
16	32	3.7085	3.8151	4.2802
32	32	4.5872	4.6662	6.8342
64	64	5.8669	6.0471	10.4833
128	128	8.6023	8.8094	14.3457
256	256	14.7483	18.9714	13.2442

vllm=0.6.3post1

Max Parallelism	Number of Prompts	Average Latency (s)	p95 Latency (s)	Throughput (requests/s)
8	32	4.1822	4.3813	1.9079
16	32	4.6502	5.1558	3.1282
32	32	6.9919	9.2724	3.4463
64	64	11.4092	18.2382	3.4930
128	128	75.9388	90.1596	1.4170
256	256	93.1698	123.8264	2.0409

see full results: https://gist.github.com/2timesjay/ebc7773aa8fb01115172f37dae86bc47

Type of Change

New example
Example updates (Bug fixes, new features, etc.)
Other (changes to the codebase, but not to examples)

Checklist

(all of these are satisfied by keeping the changes to a minimum)

Outside contributors

Jacob Jensen (2timesjay)

…s especially at high concurrency

charlesfrye · 2025-02-03T22:11:27Z

Thanks for the PR! cc @jackcook

jackcook · 2025-02-04T21:09:02Z

Looks good to me! I did some benchmarking today to look at the effects of the new V1 engine and we're seeing similar improvements internally as well.

bhaktatejas922 · 2025-02-08T04:02:20Z

theres a fair amount of things not implemented yet on v1, could be worth adding a disclaimer

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. Big speedup…

dc9bac5

…s especially at high concurrency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064

2timesjay commented Feb 3, 2025 •

edited

Loading

charlesfrye commented Feb 3, 2025

jackcook commented Feb 4, 2025

bhaktatejas922 commented Feb 8, 2025

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064

Are you sure you want to change the base?

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064

Conversation

2timesjay commented Feb 3, 2025 • edited Loading

Type of Change

Checklist

Outside contributors

charlesfrye commented Feb 3, 2025

jackcook commented Feb 4, 2025

bhaktatejas922 commented Feb 8, 2025

2timesjay commented Feb 3, 2025 •

edited

Loading