-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running on Non GPU laptops #8
Comments
From your server.log, I noticed your POST "/api/generate" is taking too long, 1h40m51s [GIN] 2024/05/23 - 11:45:01 | 200 | 1h40m51s | 127.0.0.1 | POST "/api/generate"
[GIN] 2024/05/23 - 11:45:31 | 200 | 48.6085ms | 127.0.0.1 | GET "/api/version"
[GIN] 2024/05/23 - 11:46:26 | 200 | 54.0535545s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:30 | 200 | 4.4838203s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:34 | 200 | 3.9806745s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:38 | 200 | 4.1286402s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:41 | 200 | 2.5784539s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:45 | 200 | 4.1422973s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 | 6.5098253s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/23 - 11:46:51 | 200 | 2.607ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/05/23 - 13:41:53 | 200 | 1h55m1s | 127.0.0.1 | POST "/api/generate" On my Windows machine, I used Windows Powershell to invoke python, it takes 3 to 4 minutes for /api/generate. As far as I know, the functionality of /api/generate does the following.
The way for my llm_benchmark, it calls Maybe you can consult ollama author to help investigate the issues of taking too long for /api/generate |
Thanks for the reply. I missed that when I looked at the logs. Yes 1hr
55mins is a bit too long.
If I run the query on the cli it goes through fine.
I'll go have a look and see what I can accertain
…On Thu, 23 May 2024, 16:26 Jason TC Chuang, ***@***.***> wrote:
From your server.log, I noticed your POST "/api/generate" is taking too
long, 1h40m51s
[GIN] 2024/05/23 - 11:45:01 | 200 | 1h40m51s | 127.0.0.1 | POST "/api/generate"
[GIN] 2024/05/23 - 11:45:31 | 200 | 48.6085ms | 127.0.0.1 | GET "/api/version"
[GIN] 2024/05/23 - 11:46:26 | 200 | 54.0535545s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:30 | 200 | 4.4838203s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:34 | 200 | 3.9806745s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:38 | 200 | 4.1286402s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:41 | 200 | 2.5784539s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:45 | 200 | 4.1422973s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 | 6.5098253s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/23 - 11:46:51 | 200 | 2.607ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/05/23 - 13:41:53 | 200 | 1h55m1s | 127.0.0.1 | POST "/api/generate"
On my Windows machine, I used Windows Powershell to invoke python, it
takes 3 to 4 minutes for /api/generate.
As far as I know, the functionality of /api/generate does the following.
- Receive Input Data:
The endpoint receives a request containing input data. This could be
in the form of a text prompt, parameters specifying the type of generation
required, and possibly additional settings like temperature, maximum token
count, etc.
- Process Input:
The server processes the input data, which might include
pre-processing steps such as tokenization, input validation, and ensuring
the input meets the required format.
- Generate Output Using Model:
The server uses a pre-trained model to generate the desired output
based on the input. This involves passing the processed input data to the
model, which then produces an output. The model could be a language model,
image generation model, or any other type of generative model.
- Post-Process Output:
The generated output is post-processed to ensure it is in a suitable
format for the user. This could include converting tokens back to text,
formatting the output, and applying any necessary filters.
- Send Response:
The server sends back the generated content to the client as a
response. This response typically includes the generated text or data, and
possibly metadata about the generation process (such as time taken, tokens
used, etc.).
The way for my llm_benchmark, it calls
result = subprocess.run([ollamabin, 'run', model_name,
one_prompt['prompt'],'--verbose'], capture_output=True, text=True,
check=True, encoding='utf-8')
Maybe you can consult ollama author to help investigate the issues of
taking too long for /api/generate
—
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABF2EYQMH237HKQBOV53RETZDYDDDAVCNFSM6AAAAABIFUHNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXGQZDCNJYGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hide system info. to protect user's privacy. Remove sensitive hardware spec in the question details. |
Hey. I got to the bottom of this. I hacked around in the code and made my own script to carry out the tests. The server was timing out. I had to set a 300-second timeout on the tests, or they would fail. I have since bumped that up to 600 seconds, as I just want the test to pass. Below is the output from the first two tests As you can see the Total Duration time is rediculous. 169 seconds and 200 seconds :( Model: mistral:7b
Prompt: Write a step-by-step guide on how to bake a chocolate cake from scratch.
Total Duration Time (ms): 169432.6
Load Duration Time (ms): 6.76
Prompt Eval Time (ms): 1757.56, Eval Count: 21
Performance (tokens/s): 4.29
Model: mistral:7b
Prompt: Develop a python function that solves the following problem - sudoku game.
Total Duration Time (ms): 200958.45
Load Duration Time (ms): 5.4
Prompt Eval Time (ms): 1451.98, Eval Count: 17
Performance (tokens/s): 4.26 |
If you think your hacking can bring benefits to the whole ollama community, please fork my code, and then create a pull request. Let me check your hacking. We both want the community get benefits from the tool. |
Hey.
I am writing an article comparing and contrasting my desktop PG to my laptop.
It runs fine on the desktop and gets decent throughput.
Desktop
{ "hide it" }
My laptop isn't as beefy,
Laptop
Running 'llm_benchmark run' in a Python virtual environment on my laptop is taking a very long time just to execute the first prompt against the mistral:7b model. It has been running for well over two hours.
The program did pull the 7 LLMs it required.
Looking at performance on my laptop I see the following from Task Manager
Windows Task Manager
Disk Space
Any pointers to make this run. I am convinced it cant be the ollama install as I can run "Write a step-by-step guide on how to bake a chocolate cake from scratch" against ollama running llama3:8b and it completes in a little under 3 minutes. (that's a rough guestimate from scrolling back in the logs)
Running the same prompt from 'ollama run mistral:7b' cli it completes even faster.
Why does it not complete from the llm_benchmark?
I have attached the server and app logs from my laptop to the issue
app.log
server.log
The text was updated successfully, but these errors were encountered: