Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on Non GPU laptops #8

Open
twelsh37 opened this issue May 23, 2024 · 5 comments
Open

Running on Non GPU laptops #8

twelsh37 opened this issue May 23, 2024 · 5 comments
Assignees
Labels
Investigation Investigate user's questions

Comments

@twelsh37
Copy link

twelsh37 commented May 23, 2024

Hey.

I am writing an article comparing and contrasting my desktop PG to my laptop.
It runs fine on the desktop and gets decent throughput.
Desktop

{
    "hide it"
}

My laptop isn't as beefy,
Laptop

hide it

Running 'llm_benchmark run' in a Python virtual environment on my laptop is taking a very long time just to execute the first prompt against the mistral:7b model. It has been running for well over two hours.

The program did pull the 7 LLMs it required.

Looking at performance on my laptop I see the following from Task Manager

Windows Task Manager

CPU Utilisation : 80%
CPU Speed 4.64 Ghz
Memory in Use : 10.4 GB
Memory Available : 5.2 GB

Disk Space

Total Disk Space : 474 GB
Disk Space Available : 236 GB

Any pointers to make this run. I am convinced it cant be the ollama install as I can run "Write a step-by-step guide on how to bake a chocolate cake from scratch" against ollama running llama3:8b and it completes in a little under 3 minutes. (that's a rough guestimate from scrolling back in the logs)

Running the same prompt from 'ollama run mistral:7b' cli it completes even faster.

Why does it not complete from the llm_benchmark?

I have attached the server and app logs from my laptop to the issue
app.log
server.log

@chuangtc chuangtc added the Investigation Investigate user's questions label May 23, 2024
@chuangtc
Copy link
Member

From your server.log, I noticed your POST "/api/generate" is taking too long, 1h40m51s

[GIN] 2024/05/23 - 11:45:01 | 200 |      1h40m51s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2024/05/23 - 11:45:31 | 200 |     48.6085ms |       127.0.0.1 | GET      "/api/version"
[GIN] 2024/05/23 - 11:46:26 | 200 |   54.0535545s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:30 | 200 |    4.4838203s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:34 | 200 |    3.9806745s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:38 | 200 |    4.1286402s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:41 | 200 |    2.5784539s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:45 | 200 |    4.1422973s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 |    6.5098253s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/05/23 - 11:46:51 | 200 |       2.607ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/05/23 - 13:41:53 | 200 |       1h55m1s |       127.0.0.1 | POST     "/api/generate"

On my Windows machine, I used Windows Powershell to invoke python, it takes 3 to 4 minutes for /api/generate.

As far as I know, the functionality of /api/generate does the following.

  • Receive Input Data:
    The endpoint receives a request containing input data. This could be in the form of a text prompt, parameters specifying the type of generation required, and possibly additional settings like temperature, maximum token count, etc.
  • Process Input:
    The server processes the input data, which might include pre-processing steps such as tokenization, input validation, and ensuring the input meets the required format.
  • Generate Output Using Model:
    The server uses a pre-trained model to generate the desired output based on the input. This involves passing the processed input data to the model, which then produces an output. The model could be a language model, image generation model, or any other type of generative model.
  • Post-Process Output:
    The generated output is post-processed to ensure it is in a suitable format for the user. This could include converting tokens back to text, formatting the output, and applying any necessary filters.
  • Send Response:
    The server sends back the generated content to the client as a response. This response typically includes the generated text or data, and possibly metadata about the generation process (such as time taken, tokens used, etc.).

The way for my llm_benchmark, it calls
result = subprocess.run([ollamabin, 'run', model_name, one_prompt['prompt'],'--verbose'], capture_output=True, text=True, check=True, encoding='utf-8')

Maybe you can consult ollama author to help investigate the issues of taking too long for /api/generate

@twelsh37
Copy link
Author

twelsh37 commented May 23, 2024 via email

@chuangtc
Copy link
Member

Hide system info. to protect user's privacy. Remove sensitive hardware spec in the question details.

@chuangtc chuangtc self-assigned this May 23, 2024
@twelsh37
Copy link
Author

Hey. I got to the bottom of this. I hacked around in the code and made my own script to carry out the tests.

The server was timing out. I had to set a 300-second timeout on the tests, or they would fail. I have since bumped that up to 600 seconds, as I just want the test to pass.

Below is the output from the first two tests As you can see the Total Duration time is rediculous. 169 seconds and 200 seconds :(

Model: mistral:7b
Prompt: Write a step-by-step guide on how to bake a chocolate cake from scratch.
Total Duration Time (ms): 169432.6
Load Duration Time (ms): 6.76
Prompt Eval Time (ms): 1757.56, Eval Count: 21
Performance (tokens/s): 4.29


Model: mistral:7b
Prompt: Develop a python function that solves the following problem - sudoku game.
Total Duration Time (ms): 200958.45
Load Duration Time (ms): 5.4
Prompt Eval Time (ms): 1451.98, Eval Count: 17
Performance (tokens/s): 4.26

@chuangtc
Copy link
Member

If you think your hacking can bring benefits to the whole ollama community, please fork my code, and then create a pull request. Let me check your hacking. We both want the community get benefits from the tool.
Please see my post on linkedin.
https://www.linkedin.com/pulse/ollama-benchmark-helps-buyers-decide-which-hardware-spec-chuang-ob7dc/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Investigation Investigate user's questions
Projects
None yet
Development

No branches or pull requests

2 participants