Running on Non GPU laptops #8

twelsh37 · 2024-05-23T13:11:41Z

Hey.

I am writing an article comparing and contrasting my desktop PG to my laptop.
It runs fine on the desktop and gets decent throughput.
Desktop

{
    "hide it"
}

My laptop isn't as beefy,
Laptop

hide it

Running 'llm_benchmark run' in a Python virtual environment on my laptop is taking a very long time just to execute the first prompt against the mistral:7b model. It has been running for well over two hours.

The program did pull the 7 LLMs it required.

Looking at performance on my laptop I see the following from Task Manager

Windows Task Manager

CPU Utilisation : 80%
CPU Speed 4.64 Ghz
Memory in Use : 10.4 GB
Memory Available : 5.2 GB

Disk Space

Total Disk Space : 474 GB
Disk Space Available : 236 GB

Any pointers to make this run. I am convinced it cant be the ollama install as I can run "Write a step-by-step guide on how to bake a chocolate cake from scratch" against ollama running llama3:8b and it completes in a little under 3 minutes. (that's a rough guestimate from scrolling back in the logs)

Running the same prompt from 'ollama run mistral:7b' cli it completes even faster.

Why does it not complete from the llm_benchmark?

I have attached the server and app logs from my laptop to the issue
app.log
server.log

The text was updated successfully, but these errors were encountered:

chuangtc · 2024-05-23T15:26:19Z

From your server.log, I noticed your POST "/api/generate" is taking too long, 1h40m51s

[GIN] 2024/05/23 - 11:45:01 | 200 |      1h40m51s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2024/05/23 - 11:45:31 | 200 |     48.6085ms |       127.0.0.1 | GET      "/api/version"
[GIN] 2024/05/23 - 11:46:26 | 200 |   54.0535545s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:30 | 200 |    4.4838203s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:34 | 200 |    3.9806745s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:38 | 200 |    4.1286402s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:41 | 200 |    2.5784539s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:45 | 200 |    4.1422973s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 |    6.5098253s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/05/23 - 11:46:51 | 200 |       2.607ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/05/23 - 13:41:53 | 200 |       1h55m1s |       127.0.0.1 | POST     "/api/generate"

On my Windows machine, I used Windows Powershell to invoke python, it takes 3 to 4 minutes for /api/generate.

As far as I know, the functionality of /api/generate does the following.

Receive Input Data:
The endpoint receives a request containing input data. This could be in the form of a text prompt, parameters specifying the type of generation required, and possibly additional settings like temperature, maximum token count, etc.
Process Input:
The server processes the input data, which might include pre-processing steps such as tokenization, input validation, and ensuring the input meets the required format.
Generate Output Using Model:
The server uses a pre-trained model to generate the desired output based on the input. This involves passing the processed input data to the model, which then produces an output. The model could be a language model, image generation model, or any other type of generative model.
Post-Process Output:
The generated output is post-processed to ensure it is in a suitable format for the user. This could include converting tokens back to text, formatting the output, and applying any necessary filters.
Send Response:
The server sends back the generated content to the client as a response. This response typically includes the generated text or data, and possibly metadata about the generation process (such as time taken, tokens used, etc.).

The way for my llm_benchmark, it calls
result = subprocess.run([ollamabin, 'run', model_name, one_prompt['prompt'],'--verbose'], capture_output=True, text=True, check=True, encoding='utf-8')

Maybe you can consult ollama author to help investigate the issues of taking too long for /api/generate

twelsh37 · 2024-05-23T16:23:01Z

Thanks for the reply. I missed that when I looked at the logs. Yes 1hr 55mins is a bit too long. If I run the query on the cli it goes through fine. I'll go have a look and see what I can accertain

…

On Thu, 23 May 2024, 16:26 Jason TC Chuang, ***@***.***> wrote: From your server.log, I noticed your POST "/api/generate" is taking too long, 1h40m51s [GIN] 2024/05/23 - 11:45:01 | 200 | 1h40m51s | 127.0.0.1 | POST "/api/generate" [GIN] 2024/05/23 - 11:45:31 | 200 | 48.6085ms | 127.0.0.1 | GET "/api/version" [GIN] 2024/05/23 - 11:46:26 | 200 | 54.0535545s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:30 | 200 | 4.4838203s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:34 | 200 | 3.9806745s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:38 | 200 | 4.1286402s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:41 | 200 | 2.5784539s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:45 | 200 | 4.1422973s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:51 | 200 | 6.5098253s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:51 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/05/23 - 11:46:51 | 200 | 2.607ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/23 - 13:41:53 | 200 | 1h55m1s | 127.0.0.1 | POST "/api/generate" On my Windows machine, I used Windows Powershell to invoke python, it takes 3 to 4 minutes for /api/generate. As far as I know, the functionality of /api/generate does the following. - Receive Input Data: The endpoint receives a request containing input data. This could be in the form of a text prompt, parameters specifying the type of generation required, and possibly additional settings like temperature, maximum token count, etc. - Process Input: The server processes the input data, which might include pre-processing steps such as tokenization, input validation, and ensuring the input meets the required format. - Generate Output Using Model: The server uses a pre-trained model to generate the desired output based on the input. This involves passing the processed input data to the model, which then produces an output. The model could be a language model, image generation model, or any other type of generative model. - Post-Process Output: The generated output is post-processed to ensure it is in a suitable format for the user. This could include converting tokens back to text, formatting the output, and applying any necessary filters. - Send Response: The server sends back the generated content to the client as a response. This response typically includes the generated text or data, and possibly metadata about the generation process (such as time taken, tokens used, etc.). The way for my llm_benchmark, it calls result = subprocess.run([ollamabin, 'run', model_name, one_prompt['prompt'],'--verbose'], capture_output=True, text=True, check=True, encoding='utf-8') Maybe you can consult ollama author to help investigate the issues of taking too long for /api/generate — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABF2EYQMH237HKQBOV53RETZDYDDDAVCNFSM6AAAAABIFUHNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXGQZDCNJYGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

chuangtc · 2024-05-23T17:53:01Z

Hide system info. to protect user's privacy. Remove sensitive hardware spec in the question details.

twelsh37 · 2024-05-23T21:43:41Z

Hey. I got to the bottom of this. I hacked around in the code and made my own script to carry out the tests.

The server was timing out. I had to set a 300-second timeout on the tests, or they would fail. I have since bumped that up to 600 seconds, as I just want the test to pass.

Below is the output from the first two tests As you can see the Total Duration time is rediculous. 169 seconds and 200 seconds :(

Model: mistral:7b
Prompt: Write a step-by-step guide on how to bake a chocolate cake from scratch.
Total Duration Time (ms): 169432.6
Load Duration Time (ms): 6.76
Prompt Eval Time (ms): 1757.56, Eval Count: 21
Performance (tokens/s): 4.29


Model: mistral:7b
Prompt: Develop a python function that solves the following problem - sudoku game.
Total Duration Time (ms): 200958.45
Load Duration Time (ms): 5.4
Prompt Eval Time (ms): 1451.98, Eval Count: 17
Performance (tokens/s): 4.26

chuangtc · 2024-05-23T23:28:11Z

If you think your hacking can bring benefits to the whole ollama community, please fork my code, and then create a pull request. Let me check your hacking. We both want the community get benefits from the tool.
Please see my post on linkedin.
https://www.linkedin.com/pulse/ollama-benchmark-helps-buyers-decide-which-hardware-spec-chuang-ob7dc/

chuangtc added the Investigation Investigate user's questions label May 23, 2024

chuangtc self-assigned this May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on Non GPU laptops #8

Running on Non GPU laptops #8

twelsh37 commented May 23, 2024 •

edited by chuangtc

Loading

chuangtc commented May 23, 2024

twelsh37 commented May 23, 2024 via email

chuangtc commented May 23, 2024

twelsh37 commented May 23, 2024

chuangtc commented May 23, 2024

Running on Non GPU laptops #8

Running on Non GPU laptops #8

Comments

twelsh37 commented May 23, 2024 • edited by chuangtc Loading

chuangtc commented May 23, 2024

twelsh37 commented May 23, 2024 via email

chuangtc commented May 23, 2024

twelsh37 commented May 23, 2024

chuangtc commented May 23, 2024

twelsh37 commented May 23, 2024 •

edited by chuangtc

Loading