Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local inference with llama.cpp #5

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

zhangir-azerbayev
Copy link

This PR adds support for an inference server powered by llama.cpp that runs efficiently on CPU. I am able to get a tactic suggestion in a few seconds on my 5 year old thinkpad.

See the additions to the README for installation and configuration instructions.

Unfortunately, the GPTNeoX architecture of the official llmstep model isn't supported by the llama.cpp library. I threw together quick finetune of open_llama_3b_v2, which you can find here: open-llama-3b_next-tactic_dev0.2. Note this model is extremely undertrained (less than one A100-hour), so should be viewed only as a proof of concept that inference latency on CPU can be acceptable. Training an open-llama that matches the llmstep model will be quite trivial.

A substantial refactor of python/train/tune.py was required train the open llama. But that code is still incredibly messy and I will PR it once I clean it up a bit.

The reasons this PR is a draft are:

  1. The inference server crashes if you move around your cursor too much after typing llmstep "". For now it is best to not move your cursor after typing the empty string, and actually putting text in the string almost always causes a crash.
  2. Currently, this implementation only supports generating one tactic step per llmstep call. The CPU won't handle parallel decoding well like the GPU can, but we might want to add a "get another suggestion" button or something of the sort.
  3. We probably want to train the better model compatible with llama.cpp before merging into main.

@zhangir-azerbayev zhangir-azerbayev marked this pull request as draft August 18, 2023 12:41
@yangky11
Copy link

yangky11 commented Aug 18, 2023

Just wondering if the "one tactic suggestion per" is a restriction of llama.cpp? Is it possible to use beam search or other decoding algorithms to get multiple tactics?

@zhangir-azerbayev
Copy link
Author

Just wondering if the "one tactic suggestion per" is a restriction of llama.cpp? Is it possible to use beam search or other decoding algorithms to get multiple tactics?

Based on this section of the llama.cpp docs, it appears to be a restriction of llama.cpp. In fact, I would suspect it's less a restriction of llama.cpp and more a fundamental limitation of CPU inference, since CPUs probably don't have the FLOPs to do batch level parallelism.

@zhangir-azerbayev
Copy link
Author

I trained a new tactic generator for 15 epochs on the llmstep training set. In principle, this model should be competitive with the llmstep model but I haven't extensively checked the training code for bugs that may be impacting performance.


In the following step, you will quantize your model to a reduced precision format. The available formats are `F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0`, with lower precision formats trading off accuracy for latency and memory. I would recommend starting with `Q4_0`, and increasing precision if your hardware handles lower precisions comfortably.
```bash
./quantize $PATH_TO_MODEL/ggml-model-f32.bin $PATH_TO_QUANTIZED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing the type argument for quantize (e.g. q4_0) after the $PATH_TO_QUANTIZED.

README.md Outdated Show resolved Hide resolved
@kim-em
Copy link
Contributor

kim-em commented Aug 21, 2023

I got this running with the v0.3 model, quantized as q4_0. Unfortunately it is rather unsuccessful on the Examples.lean.

@kim-em
Copy link
Contributor

kim-em commented Aug 21, 2023

I only saw one server crash, a segfault.

Co-authored-by: Scott Morrison <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants