Local inference with llama.cpp #5

zhangir-azerbayev · 2023-08-18T12:41:36Z

This PR adds support for an inference server powered by llama.cpp that runs efficiently on CPU. I am able to get a tactic suggestion in a few seconds on my 5 year old thinkpad.

See the additions to the README for installation and configuration instructions.

Unfortunately, the GPTNeoX architecture of the official llmstep model isn't supported by the llama.cpp library. I threw together quick finetune of open_llama_3b_v2, which you can find here: open-llama-3b_next-tactic_dev0.2. Note this model is extremely undertrained (less than one A100-hour), so should be viewed only as a proof of concept that inference latency on CPU can be acceptable. Training an open-llama that matches the llmstep model will be quite trivial.

A substantial refactor of python/train/tune.py was required train the open llama. But that code is still incredibly messy and I will PR it once I clean it up a bit.

The reasons this PR is a draft are:

The inference server crashes if you move around your cursor too much after typing llmstep "". For now it is best to not move your cursor after typing the empty string, and actually putting text in the string almost always causes a crash.
Currently, this implementation only supports generating one tactic step per llmstep call. The CPU won't handle parallel decoding well like the GPU can, but we might want to add a "get another suggestion" button or something of the sort.
We probably want to train the better model compatible with llama.cpp before merging into main.

…or now

yangky11 · 2023-08-18T14:34:43Z

Just wondering if the "one tactic suggestion per" is a restriction of llama.cpp? Is it possible to use beam search or other decoding algorithms to get multiple tactics?

zhangir-azerbayev · 2023-08-18T14:59:45Z

Just wondering if the "one tactic suggestion per" is a restriction of llama.cpp? Is it possible to use beam search or other decoding algorithms to get multiple tactics?

Based on this section of the llama.cpp docs, it appears to be a restriction of llama.cpp. In fact, I would suspect it's less a restriction of llama.cpp and more a fundamental limitation of CPU inference, since CPUs probably don't have the FLOPs to do batch level parallelism.

zhangir-azerbayev · 2023-08-19T09:10:57Z

I trained a new tactic generator for 15 epochs on the llmstep training set. In principle, this model should be competitive with the llmstep model but I haven't extensively checked the training code for bugs that may be impacting performance.

kim-em · 2023-08-21T04:14:02Z

README.md

+
+In the following step, you will quantize your model to a reduced precision format. The available formats are `F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0`, with lower precision formats trading off accuracy for latency and memory. I would recommend starting with `Q4_0`, and increasing precision if your hardware handles lower precisions comfortably. 
+```bash
+./quantize $PATH_TO_MODEL/ggml-model-f32.bin $PATH_TO_QUANTIZED


This is missing the type argument for quantize (e.g. q4_0) after the $PATH_TO_QUANTIZED.

README.md

kim-em · 2023-08-21T04:26:09Z

I got this running with the v0.3 model, quantized as q4_0. Unfortunately it is rather unsuccessful on the Examples.lean.

kim-em · 2023-08-21T04:26:42Z

I only saw one server crash, a segfault.

Co-authored-by: Scott Morrison <[email protected]>

zhangir-azerbayev added 5 commits August 18, 2023 01:29

add server with a ggml model, has a placeholder path to a local bin f…

21959bc

…or now

remove hardcoded path

0e06b0e

correct eos sequence

369a4c9

add instructions for llama cpp

5d7c80d

update requirements

82caade

zhangir-azerbayev marked this pull request as draft August 18, 2023 12:41

kim-em reviewed Aug 21, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

Update README.md

db8e698

Co-authored-by: Scott Morrison <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local inference with llama.cpp #5

Local inference with llama.cpp #5

zhangir-azerbayev commented Aug 18, 2023

yangky11 commented Aug 18, 2023 •

edited

Loading

zhangir-azerbayev commented Aug 18, 2023

zhangir-azerbayev commented Aug 19, 2023

kim-em Aug 21, 2023

kim-em commented Aug 21, 2023

kim-em commented Aug 21, 2023

Local inference with llama.cpp #5

Are you sure you want to change the base?

Local inference with llama.cpp #5

Conversation

zhangir-azerbayev commented Aug 18, 2023

yangky11 commented Aug 18, 2023 • edited Loading

zhangir-azerbayev commented Aug 18, 2023

zhangir-azerbayev commented Aug 19, 2023

kim-em Aug 21, 2023

Choose a reason for hiding this comment

kim-em commented Aug 21, 2023

kim-em commented Aug 21, 2023

yangky11 commented Aug 18, 2023 •

edited

Loading