You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am attempting to reproduce the results for the Llama-3.1-8B-Instruct model by following the steps provided in the README. Everything is set up within your Docker environment, and I am using vLLM for inference. My setup includes a single H100 GPU with a batch size of 8, as specified in the example scripts.
With this configuration, the runtime for processing a 128k context length (synthetic task) is approximately 2 days. Is this runtime expected? If not, could you please share the configuration or optimizations you used to efficiently handle this context length?
The text was updated successfully, but these errors were encountered:
Hi @eldarkurtic, I don't apply any additional optimizations when running inference. I usually use 8 GPUs with TP=8 using vLLM. It takes around 2 hours to run 128K length with 500 samples for Llama-3.1-8B-Instruct.
Hi,
I am attempting to reproduce the results for the Llama-3.1-8B-Instruct model by following the steps provided in the README. Everything is set up within your Docker environment, and I am using vLLM for inference. My setup includes a single H100 GPU with a batch size of 8, as specified in the example scripts.
With this configuration, the runtime for processing a 128k context length (synthetic task) is approximately 2 days. Is this runtime expected? If not, could you please share the configuration or optimizations you used to efficiently handle this context length?
The text was updated successfully, but these errors were encountered: