Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

Open
denadai2 opened this issue Dec 30, 2024 · 3 comments
Labels

Comments

@denadai2
Copy link
Contributor

denadai2 commented Dec 30, 2024

Describe the issue as clearly as possible:

I have some issues with vllm + outlines. It seems the performance is WAAAY worse in the new outlines version. Could you help me? Using it with an H100. I observe that the GPU compute utilization is much lower.

Note that here I use llama 3.2 1B, but the difference increase with larger models: llama 3.3 70B in a private usecase I have has ~ 800 output tokens per second while the new outlines has ~ 70 of them.

New outlines e.g. vllm v0.6.5 (see the processed prompts speed) - outlines 0.1.8

WARNING 12-30 13:46:30 cuda.py:32] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 12-30 13:46:43 config.py:478] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 12-30 13:46:43 arg_utils.py:1086] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 12-30 13:46:43 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 12-30 13:46:43 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/models/llama-v3.2-1b-it/', speculative_config=None, tokenizer='/data/models/llama-v3.2-1b-it/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/llama-v3.2-1b-it/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
INFO 12-30 13:46:45 selector.py:120] Using Flash Attention backend.
INFO 12-30 13:46:48 model_runner.py:1092] Starting to load model /data/models/llama-v3.2-1b-it/...
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.80it/s]
INFO 12-30 13:46:49 model_runner.py:1097] Loading model weights took 2.3185 GB
INFO 12-30 13:46:49 worker.py:241] Memory profiling takes 0.46 seconds
INFO 12-30 13:46:49 worker.py:241] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 12-30 13:46:49 worker.py:241] model weights take 2.32GiB; non_torch_memory takes 0.17GiB; PyTorch activation peak memory takes 1.20GiB; the rest of the memory reserved for KV Cache is 67.50GiB.
INFO 12-30 13:46:50 gpu_executor.py:76] # GPU blocks: 138231, # CPU blocks: 8192
INFO 12-30 13:46:50 gpu_executor.py:80] Maximum concurrency for 131072 tokens per request: 16.87x
INFO 12-30 13:46:51 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-30 13:46:51 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 12-30 13:47:01 model_runner.py:1527] Graph capturing finished in 10 secs, took 0.21 GiB
INFO 12-30 13:47:01 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 12.39 seconds
Processed prompts: 100%|██████████| 100[/100](http://10.169.23.198:8081/100) [01:34<00:00,  1.06it[/s](http://10.169.23.198:8081/s), est. speed input: 81.68 toks[/s](http://10.169.23.198:8081/s), output: 104.59 toks[/s](http://10.169.23.198:8081/s)]

Old outlines vllm v0.6.1 - outlines 0.0.46

WARNING 12-30 13:48:29 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:29 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/models/llama-v3.2-1b-it/', speculative_config=None, tokenizer='/data/models/llama-v3.2-1b-it/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/llama-v3.2-1b-it/, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:33 model_runner.py:1014] Starting to load model /data/models/llama-v3.2-1b-it/...
Loading safetensors checkpoint shards:   0% Completed | 0[/1](http://10.169.59.132:8081/1) [00:00<?, ?it[/s](http://10.169.59.132:8081/s)]
Loading safetensors checkpoint shards: 100% Completed | 1[/1](http://10.169.59.132:8081/1) [00:00<00:00,  1.98it[/s](http://10.169.59.132:8081/s)]
Loading safetensors checkpoint shards: 100% Completed | 1[/1](http://10.169.59.132:8081/1) [00:00<00:00,  1.98it[/s](http://10.169.59.132:8081/s)]
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:34 model_runner.py:1025] Loading model weights took 2.3185 GB
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:34 gpu_executor.py:122] # GPU blocks: 138166, # CPU blocks: 8192
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:36 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:36 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:46 model_runner.py:1456] Graph capturing finished in 10 secs.
Compiling FSM index for all state transitions:   0%|          | 0[/47](http://10.169.59.132:8081/47) [00:00<?, ?it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:   2%|▏         | 1[/47](http://10.169.59.132:8081/47) [00:00<00:12,  3.63it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  11%|█         | 5[/47](http://10.169.59.132:8081/47) [00:00<00:02, 15.28it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  19%|█▉        | 9[/47](http://10.169.59.132:8081/47) [00:00<00:01, 22.37it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  28%|██▊       | 13[/47](http://10.169.59.132:8081/47) [00:00<00:01, 20.41it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  36%|███▌      | 17[/47](http://10.169.59.132:8081/47) [00:00<00:01, 23.81it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  45%|████▍     | 21[/47](http://10.169.59.132:8081/47) [00:00<00:00, 26.30it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  53%|█████▎    | 25[/47](http://10.169.59.132:8081/47) [00:01<00:00, 28.93it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  62%|██████▏   | 29[/47](http://10.169.59.132:8081/47) [00:01<00:00, 31.01it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  70%|███████   | 33[/47](http://10.169.59.132:8081/47) [00:01<00:00, 32.52it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  79%|███████▊  | 37[/47](http://10.169.59.132:8081/47) [00:01<00:00, 32.99it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  87%|████████▋ | 41[/47](http://10.169.59.132:8081/47) [00:01<00:00, 26.43it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  96%|█████████▌| 45[/47](http://10.169.59.132:8081/47) [00:01<00:00, 28.13it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 100%|██████████| 47[/47](http://10.169.59.132:8081/47) [00:02<00:00, 21.38it[/s](http://10.169.59.132:8081/s)]
Processed prompts:   0%|          | 0[/100](http://10.169.59.132:8081/100) [00:00<?, ?it[/s](http://10.169.59.132:8081/s), est. speed input: 0.00 toks[/s](http://10.169.59.132:8081/s), output: 0.00 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:   1%|          | 1[/100](http://10.169.59.132:8081/100) [00:00<00:34,  2.83it[/s](http://10.169.59.132:8081/s), est. speed input: 203.86 toks[/s](http://10.169.59.132:8081/s), output: 50.96 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:  12%|█▏        | 12[/100](http://10.169.59.132:8081/100) [00:00<00:02, 32.77it[/s](http://10.169.59.132:8081/s), est. speed input: 1866.35 toks[/s](http://10.169.59.132:8081/s), output: 520.57 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:  61%|██████    | 61[/100](http://10.169.59.132:8081/100) [00:00<00:00, 134.34it[/s](http://10.169.59.132:8081/s), est. speed input: 6475.71 toks[/s](http://10.169.59.132:8081/s), output: 2667.18 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:  93%|█████████▎| 93[/100](http://10.169.59.132:8081/100) [00:00<00:00, 179.39it[/s](http://10.169.59.132:8081/s), est. speed input: 8432.94 toks[/s](http://10.169.59.132:8081/s), output: 4387.65 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts: 100%|██████████| 100[/100](http://10.169.59.132:8081/100) [00:01<00:00, 94.83it[/s](http://10.169.59.132:8081/s), est. speed input: 6828.35 toks[/s](http://10.169.59.132:8081/s), output: 3900.61 toks[/s](http://10.169.59.132:8081/s)]

Steps/code to reproduce the bug:

New outlines

"""Example of integrating `outlines` with `vllm`."""

import vllm
from pydantic import BaseModel
from transformers import AutoTokenizer
from outlines.models.vllm import adapt_tokenizer

from outlines.processors import JSONLogitsProcessor


class Person(BaseModel):
    name: str
    description: str

MODEL_ID = "/data/models/llama-v3.2-1b-it/"
llm = vllm.LLM(model=MODEL_ID)
tokenizer = adapt_tokenizer(AutoTokenizer.from_pretrained(MODEL_ID))
logits_processor = JSONLogitsProcessor(schema=Person, tokenizer=tokenizer)
result = llm.generate(
    ["""<s>[INST] <<SYS>>
    You are a json text extractor. return the following json {"name": "the game name", "description": "description of the game in around 400 words"}
    <</SYS>>
    { CD Projekt Red is ramping up production on The Witcher 4, and of course it's looking into using AI } [/INST]"""]*100,
    sampling_params=vllm.SamplingParams(
        temperature=0.6,
        max_tokens=1024,
        logits_processors=[logits_processor],
    ),
)
print(result)

Old outlines

import ray

import vllm
from pydantic import BaseModel
    
from outlines.integrations.vllm import JSONLogitsProcessor
    
class Person(BaseModel):
    name: str
    description: str
    
MODEL_ID = "/data/models/llama-v3.2-1b-it/"
llm = vllm.LLM(model=MODEL_ID)
logits_processor = JSONLogitsProcessor(schema=Person, llm=llm)
result = llm.generate(
        ["""<s>[INST] <<SYS>>
    You are a json text extractor. return the following json {"name": "the game name", "description": "description of the game"}
    <</SYS>>
    { CD Projekt Red is ramping up production on The Witcher 4, and of course it's looking into using AI } [/INST]"""]*100,
        sampling_params=vllm.SamplingParams(
            temperature=0.6,
            max_tokens=1024,
    )
print(result)


### Expected result:

```shell
A similar speed, which happens if I remove the outlines call `logits_processors=[logits_processor],`

Error message:

No response

Outlines/Python version information:

You can see that from the logs

Context for the issue:

No response

@denadai2 denadai2 added the bug label Dec 30, 2024
@saattrupdan
Copy link
Contributor

I second this. Further, aside from taking longer, it also accumulates way more RAM during (what seems to be) the compilation stage. Uses >50GB RAM and crashes when using vLLM with ~2k samples. The previous version is way faster and only uses ~4GB RAM

@rlouf
Copy link
Member

rlouf commented Jan 15, 2025

Is this still the case on the latest release?

@kayvane1
Copy link

Yes it is, tried to upgrade to the latest version last week and had to revert the change as a result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants