Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Long output and issues when running benchmark_serving.py on DeepSeek-V3 #2746

Open
5 tasks done
lhl opened this issue Jan 6, 2025 · 5 comments
Open
5 tasks done

Comments

@lhl
Copy link

lhl commented Jan 6, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I have a curious issue. I am testing sglang running DeepSeek-V3 with on 2 x H100 (refreshingly easy setup) w/ the latest pip package (sglang-0.4.1.post4). This is an isolated mamba venv on an AWS Sagemaker instance w/ just the latest PyTorch stable (PyTorch version: 2.5.1+cu121 and the sglang pip installed - collectenv at the bottom).

I am running a basic concurrency sweep with vLLM's benchmark_serving.py:

for c in 64 128 256 512; do echo "Running with concurrency $c..." && python ~/vllm/benchmarks/benchmark_serving.py --backend openai-chat --host localhost --port 30000 --endpoint='/v1/chat/completions' --model "deepseek-ai/DeepSeek-V3" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1024 --max-concurrency $c --seed 42; done

This work fine (if very slow, atm w/ DeepSeek-V3) in vLLM and while slightly non-deterministic, consistently outputs ~196500 tokens. At concurrency 64, the latest vLLM output looks like this (also at tp16):

============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  849.46
Total input tokens:                      229783
Total generated tokens:                  196323
Request throughput (req/s):              1.21
Output token throughput (tok/s):         231.11
Total Token throughput (tok/s):          501.62
---------------Time to First Token----------------
Mean TTFT (ms):                          2639.01
Median TTFT (ms):                        4076.09
P99 TTFT (ms):                           5234.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          319.29
Median TPOT (ms):                        236.82
P99 TPOT (ms):                           1464.94
---------------Inter-token Latency----------------
Mean ITL (ms):                           4421.57
Median ITL (ms):                         4402.53
P99 ITL (ms):                            5306.83
==================================================

Maybe of interest, here's what concurrency 64 looks like on sglang - much faster and generally more responsive, however, it generates a lot more tokens for some reason:

============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  1185.05
Total input tokens:                      229783
Total generated tokens:                  720840
Request throughput (req/s):              0.86
Output token throughput (tok/s):         608.28
Total Token throughput (tok/s):          802.18
---------------Time to First Token----------------
Mean TTFT (ms):                          1291.68
Median TTFT (ms):                        345.92
P99 TTFT (ms):                           17530.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          86.67
Median TPOT (ms):                        85.40
P99 TPOT (ms):                           110.14
---------------Inter-token Latency----------------
Mean ITL (ms):                           85.37
Median ITL (ms):                         66.07
P99 ITL (ms):                            298.63
==================================================

There are always strays at the end of the loop with requests generating 10s if not 100K+ tokens. At ~20 tok/s this can go on for quite a while as you can imagine. I decided to just fix the output --sharegpt-output-len 1024 and rerun the script for sglang. I left this running overnight and the c=64 seemed to mostly work, but you can see something weird - the output token throughput is quite low, and only 1023 requests finished:

============ Serving Benchmark Result ============
Successful requests:                     1023
Benchmark duration (s):                  19962.99
Total input tokens:                      239590
Total generated tokens:                  908359
Request throughput (req/s):              0.05
Output token throughput (tok/s):         45.50
Total Token throughput (tok/s):          57.50
---------------Time to First Token----------------
Mean TTFT (ms):                          439.19
Median TTFT (ms):                        342.20
P99 TTFT (ms):                           1815.35
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.64
Median TPOT (ms):                        84.31
P99 TPOT (ms):                           95.29
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.22
Median ITL (ms):                         66.73
P99 ITL (ms):                            296.43
==================================================

The client reports for the unfinished output for one request:

Token indices sequence length is longer than the specified maximum sequence length for this model (163839 > 131072). Running this sequence through the model will result in indexing errors

Here's what is logged on the server-side.

So interestingly, it looks like there's actually a second-to-last request that seems to just go on forever (300K, far beyond the context window) until it OOMs:

[2025-01-06 02:58:11 TP0] Decode batch. #running-req: 2, #token: 307481, token usage: 1.00, gen throughput (token/s): 10.92, #queue-req: 0
[2025-01-06 02:58:19 TP0] Decode batch. #running-req: 2, #token: 307561, token usage: 1.00, gen throughput (token/s): 10.92, #queue-req: 0
[2025-01-06 02:58:26 TP0] Decode batch. #running-req: 2, #token: 307641, token usage: 1.00, gen throughput (token/s): 10.92, #queue-req: 0
[2025-01-06 02:58:33 TP0] Decode batch. #running-req: 2, #token: 307721, token usage: 1.00, gen throughput (token/s): 10.94, #queue-req: 0
[2025-01-06 02:58:41 TP0] Decode batch. #running-req: 2, #token: 307801, token usage: 1.00, gen throughput (token/s): 10.93, #queue-req: 0
[2025-01-06 02:58:44 TP0] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.0980 -> 0.9446
[2025-01-06 02:58:44 TP5] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.0980 -> 0.9446
[2025-01-06 02:58:44 TP7] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.0980 -> 0.9446
[2025-01-06 02:58:44 TP2] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.0980 -> 0.9446
[2025-01-06 02:58:44 TP6] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.0980 -> 0.9446
[2025-01-06 02:58:44 TP4] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.0980 -> 0.9446
[2025-01-06 02:58:44 TP3] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.0980 -> 0.9446
[2025-01-06 02:58:44 TP1] Decode out of memory happened. #retracted_reqs: 1, #new_token_ratio: 0.0980 -> 0.9446

And then there's one remaining request that keeps going on for a while until it apparently hangs and times out:

[2025-01-06 02:58:48 TP0] Decode batch. #running-req: 1, #token: 154772, token usage: 0.50, gen throughput (token/s): 8.16, #queue-req: 1
...
[2025-01-06 03:25:56 TP0] Decode batch. #running-req: 1, #token: 163652, token usage: 0.53, gen throughput (token/s): 5.30, #queue-req: 1
[2025-01-06 03:26:03 TP0] Decode batch. #running-req: 1, #token: 163692, token usage: 0.53, gen throughput (token/s): 5.28, #queue-req: 1
[2025-01-06 03:26:11 TP0] Decode batch. #running-req: 1, #token: 163732, token usage: 0.53, gen throughput (token/s): 5.29, #queue-req: 1
[2025-01-06 03:26:19 TP0] Decode batch. #running-req: 1, #token: 163772, token usage: 0.53, gen throughput (token/s): 5.30, #queue-req: 1
[2025-01-06 03:26:26 TP0] Decode batch. #running-req: 1, #token: 163812, token usage: 0.53, gen throughput (token/s): 5.30, #queue-req: 1
[2025-01-06 03:26:31 TP0] Prefill batch. #new-seq: 1, #new-token: 8192, #cached-token: 236, cache hit rate: 2.04%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-06 03:26:31 TP0] Prefill batch. #new-seq: 1, #new-token: 8192, #cached-token: 0, cache hit rate: 2.02%, token usage: 0.03, #running-req: 0, #queue-req: 1
[2025-01-06 03:31:34 TP5] Watchdog timeout (self.watchdog_timeout=300)
[2025-01-06 03:31:34 TP2] Watchdog timeout (self.watchdog_timeout=300)
[2025-01-06 03:31:34 TP6] Watchdog timeout (self.watchdog_timeout=300)
[2025-01-06 03:31:34 TP4] Watchdog timeout (self.watchdog_timeout=300)
[2025-01-06 03:31:34 TP1] Watchdog timeout (self.watchdog_timeout=300)
[2025-01-06 03:31:34 TP3] Watchdog timeout (self.watchdog_timeout=300)
[2025-01-06 03:31:34 TP0] Watchdog timeout (self.watchdog_timeout=300)
[2025-01-06 03:31:34 TP7] Watchdog timeout (self.watchdog_timeout=300)
Killed

I'm brand new to running sglang, but a few potential issues:

  • I know that results aren't deterministic https://sgl-project.github.io/references/faq.html but it seems the output keeps going on - is the an EOS/stop token issue, something else going on?

  • Does sglang not respect max_tokens or the model context-window length? It seems like it shouldn't be going past that at the very least? Checking the command parameters it looks like it should at respect the max context length of the model by default?

  --context-length CONTEXT_LENGTH
                        The model's maximum context length. Defaults to None (will use the value from the model's config.json instead).

Reproduction

I am running it without additional options:

# node0
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init node0:50000 --nnodes 2 --node-rank 0 --trust-remote-code

# node1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init node0:50000 --nnodes 2 --node-rank 1 --trust-remote-code

I am running DeepSeek-V3

Environment

$ python3 -m sglang.check_env
[2025-01-06 05:52:40] INFO _client.py:1038: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
Python: 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda-12.1
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.127.05
PyTorch: 2.5.1+cu121
sglang: 0.4.1.post4
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.47.1
torchao: 0.7.0
numpy: 1.26.3
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.8
huggingface_hub: 0.27.0
interegular: 0.3.3
modelscope: 1.21.0
orjson: 3.10.12
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.4
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.58.1
anthropic: 0.42.0
decord: 0.6.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-47    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-47    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-47    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-47    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    48-95   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    48-95   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    48-95   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      48-95   1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 8192
@zhyncs
Copy link
Member

zhyncs commented Jan 6, 2025

May you try python3 -m sglang.bench_serving --backend sglang --num-prompts 1024 instead?

@jischein
Copy link

jischein commented Jan 7, 2025

May you try python3 -m sglang.bench_serving --backend sglang --num-prompts 1024 instead?

I tried running this on a similar configuration (2x8xH100). The requests hang and my server crashes.

@jischein
Copy link

jischein commented Jan 7, 2025

python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.233.88.177:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8001

What I am seeing (on node 2)

[2025-01-06 17:48:40 TP15] Using configuration from /opt/vllm-foundry/env/src/sglang/python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-06 17:48:40 TP14] Using configuration from /opt/vllm-foundry/env/src/sglang/python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-06 17:48:40 TP8] Using configuration from /opt/vllm-foundry/env/src/sglang/python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-06 17:48:40 TP10] Using configuration from /opt/vllm-foundry/env/src/sglang/python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-06 17:48:40 TP12] Using configuration from /opt/vllm-foundry/env/src/sglang/python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-06 17:48:40 TP13] Using configuration from /opt/vllm-foundry/env/src/sglang/python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-06 17:48:40 TP11] Using configuration from /opt/vllm-foundry/env/src/sglang/python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-06 17:48:40 TP9] Using configuration from /opt/vllm-foundry/env/src/sglang/python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-07 15:55:04 TP8] Cache flushed successfully!
[2025-01-07 15:55:04 TP10] Cache flushed successfully!
[2025-01-07 15:55:04 TP11] Cache flushed successfully!
[2025-01-07 15:55:04 TP12] Cache flushed successfully!
[2025-01-07 15:55:04 TP14] Cache flushed successfully!
[2025-01-07 15:55:04 TP9] Cache flushed successfully!
[2025-01-07 15:55:04 TP13] Cache flushed successfully!
[2025-01-07 15:55:04 TP15] Cache flushed successfully!

On my node running the server (node 1):

[2025-01-07 15:55:07] INFO:     127.0.0.1:37978 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:37994 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38010 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38022 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38030 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38038 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38052 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38068 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38074 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38084 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38094 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38106 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38116 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38126 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38134 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38150 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38152 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38162 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38178 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38192 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38196 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38208 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38216 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38232 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38234 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38236 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38248 - "POST /generate HTTP/1.1" 200 OK
[2025-01-07 15:55:07] INFO:     127.0.0.1:38254 - "POST /generate HTTP/1.1" 200 OK
^[[A^[[A^[[A^[[A^[[A^[[A

@jischein
Copy link

jischein commented Jan 7, 2025

Note: I was able to run the benchmark by adding additional arguments found in this comment:

python -m sglang.launch_server  --model-path deepseek-ai/DeepSeek-V3  --tp 16    --dist-init-addr 10.233.88.177:20000  --nnodes 2  --node-rank 0  --trust-remote-code  --host 0.0.0.0  --port 8001 --watchdog-timeout 36000 --max-running-requests 200 --schedule-conservativeness 1.2 --max-total-tokens 1638400 --enable-torch-compile --kv-cache-dtype fp8_e5m2 --mem-fraction-static 0.9 --disable-cuda-graph
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     1024
Benchmark duration (s):                  708.26
Total input tokens:                      234540
Total generated tokens:                  196165
Total generated tokens (retokenized):    195258
Request throughput (req/s):              1.45
Input token throughput (tok/s):          331.15
Output token throughput (tok/s):         276.97
Total token throughput (tok/s):          608.12
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   262832.30
Median E2E Latency (ms):                 267866.26
---------------Time to First Token----------------
Mean TTFT (ms):                          183638.60
Median TTFT (ms):                        194075.98
P99 TTFT (ms):                           386002.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          499.99
Median TPOT (ms):                        420.31
P99 TPOT (ms):                           1934.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           416.74
Median ITL (ms):                         375.70
P99 ITL (ms):                            2492.53

@lhl
Copy link
Author

lhl commented Jan 7, 2025

May you try python3 -m sglang.bench_serving --backend sglang --num-prompts 1024 instead?

Hi, sorry for the delay, nodes were busy yesterday so just got a chance to revisit. This seem to run without the overruns:

(sglang) ubuntu@ip-10-1-1-135:~$ python3 -m sglang.bench_serving --backend sglang --num-prompts 1024
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1024, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=None, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

Downloading from https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json to /tmp/ShareGPT_V3_unfiltered_cleaned_split.json
/tmp/ShareGPT_V3_unfiltered_cleaned_split.json: 100%|███████████████████████████████████████████████████████████████████████████████████| 642M/642M [00:18<00:00, 35.9MB/s]
#Input tokens: 234540
#Output tokens: 196165
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [03:15<00:00,  5.23it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     1024
Benchmark duration (s):                  195.62
Total input tokens:                      234540
Total generated tokens:                  196165
Total generated tokens (retokenized):    195255
Request throughput (req/s):              5.23
Input token throughput (tok/s):          1198.97
Output token throughput (tok/s):         1002.80
Total token throughput (tok/s):          2201.76
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   88791.54
Median E2E Latency (ms):                 79855.56
---------------Time to First Token----------------
Mean TTFT (ms):                          28480.15
Median TTFT (ms):                        23161.81
P99 TTFT (ms):                           54696.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          734.86
Median TPOT (ms):                        395.56
P99 TPOT (ms):                           4234.36
---------------Inter-token Latency----------------
Mean ITL (ms):                           317.39
Median ITL (ms):                         193.56
P99 ITL (ms):                            3201.96
==================================================

I ran the followup script that using the same seed as the vllm benchmark_serving.py:

(sglang) ubuntu@ip-10-1-1-135:~$ for c in 64 128 256 512; do echo "Running with concurrency $c..." && python3 -m sglang.bench_serving --backend sglang --num-prompts 1024 --max-concurrency $c --seed 42; done
Running with concurrency 64...
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1024, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=64, seed=42, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 229783
#Output tokens: 207591
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [06:38<00:00,  2.57it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                64
Successful requests:                     1024
Benchmark duration (s):                  398.99
Total input tokens:                      229783
Total generated tokens:                  207591
Total generated tokens (retokenized):    206702
Request throughput (req/s):              2.57
Input token throughput (tok/s):          575.91
Output token throughput (tok/s):         520.29
Total token throughput (tok/s):          1096.20
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23386.87
Median E2E Latency (ms):                 15854.39
---------------Time to First Token----------------
Mean TTFT (ms):                          494.63
Median TTFT (ms):                        409.65
P99 TTFT (ms):                           1562.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          112.12
Median TPOT (ms):                        116.39
P99 TPOT (ms):                           156.07
---------------Inter-token Latency----------------
Mean ITL (ms):                           113.76
Median ITL (ms):                         65.04
P99 ITL (ms):                            489.22
==================================================

...

And it seems to run w/o problem. So that's a big mysterious. It looks like I can use sglang.bench_serving for testing vLLM so maybe I'll try the reverse and not worry too much about the strange behavior I observed with benchmark_serving.py...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants