-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Long output and issues when running benchmark_serving.py on DeepSeek-V3 #2746
Comments
May you try |
I tried running this on a similar configuration (2x8xH100). The requests hang and my server crashes. |
What I am seeing (on node 2)
On my node running the server (node 1):
|
Note: I was able to run the benchmark by adding additional arguments found in this comment:
|
Hi, sorry for the delay, nodes were busy yesterday so just got a chance to revisit. This seem to run without the overruns:
I ran the followup script that using the same seed as the vllm
And it seems to run w/o problem. So that's a big mysterious. It looks like I can use |
Checklist
Describe the bug
I have a curious issue. I am testing
sglang
running DeepSeek-V3 with on 2 x H100 (refreshingly easy setup) w/ the latest pip package (sglang-0.4.1.post4). This is an isolated mamba venv on an AWS Sagemaker instance w/ just the latest PyTorch stable (PyTorch version: 2.5.1+cu121 and the sglang pip installed - collectenv at the bottom).I am running a basic concurrency sweep with vLLM's
benchmark_serving.py
:This work fine (if very slow, atm w/ DeepSeek-V3) in vLLM and while slightly non-deterministic, consistently outputs ~196500 tokens. At concurrency 64, the latest vLLM output looks like this (also at tp16):
Maybe of interest, here's what concurrency 64 looks like on sglang - much faster and generally more responsive, however, it generates a lot more tokens for some reason:
There are always strays at the end of the loop with requests generating 10s if not 100K+ tokens. At ~20 tok/s this can go on for quite a while as you can imagine. I decided to just fix the output
--sharegpt-output-len 1024
and rerun the script for sglang. I left this running overnight and the c=64 seemed to mostly work, but you can see something weird - the output token throughput is quite low, and only 1023 requests finished:The client reports for the unfinished output for one request:
Here's what is logged on the server-side.
So interestingly, it looks like there's actually a second-to-last request that seems to just go on forever (300K, far beyond the context window) until it OOMs:
And then there's one remaining request that keeps going on for a while until it apparently hangs and times out:
I'm brand new to running sglang, but a few potential issues:
I know that results aren't deterministic https://sgl-project.github.io/references/faq.html but it seems the output keeps going on - is the an EOS/stop token issue, something else going on?
Does sglang not respect max_tokens or the model context-window length? It seems like it shouldn't be going past that at the very least? Checking the command parameters it looks like it should at respect the max context length of the model by default?
Reproduction
I am running it without additional options:
I am running DeepSeek-V3
Environment
The text was updated successfully, but these errors were encountered: