-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local inference with llama.cpp #5
base: master
Are you sure you want to change the base?
Conversation
Just wondering if the "one tactic suggestion per" is a restriction of llama.cpp? Is it possible to use beam search or other decoding algorithms to get multiple tactics? |
Based on this section of the llama.cpp docs, it appears to be a restriction of llama.cpp. In fact, I would suspect it's less a restriction of llama.cpp and more a fundamental limitation of CPU inference, since CPUs probably don't have the FLOPs to do batch level parallelism. |
I trained a new tactic generator for 15 epochs on the llmstep training set. In principle, this model should be competitive with the llmstep model but I haven't extensively checked the training code for bugs that may be impacting performance. |
|
||
In the following step, you will quantize your model to a reduced precision format. The available formats are `F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0`, with lower precision formats trading off accuracy for latency and memory. I would recommend starting with `Q4_0`, and increasing precision if your hardware handles lower precisions comfortably. | ||
```bash | ||
./quantize $PATH_TO_MODEL/ggml-model-f32.bin $PATH_TO_QUANTIZED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing the type
argument for quantize (e.g. q4_0
) after the $PATH_TO_QUANTIZED.
I got this running with the v0.3 model, quantized as q4_0. Unfortunately it is rather unsuccessful on the |
I only saw one server crash, a segfault. |
Co-authored-by: Scott Morrison <[email protected]>
This PR adds support for an inference server powered by llama.cpp that runs efficiently on CPU. I am able to get a tactic suggestion in a few seconds on my 5 year old thinkpad.
See the additions to the README for installation and configuration instructions.
Unfortunately, the GPTNeoX architecture of the official llmstep model isn't supported by the llama.cpp library. I threw together quick finetune of open_llama_3b_v2, which you can find here: open-llama-3b_next-tactic_dev0.2. Note this model is extremely undertrained (less than one A100-hour), so should be viewed only as a proof of concept that inference latency on CPU can be acceptable. Training an open-llama that matches the llmstep model will be quite trivial.
A substantial refactor of
python/train/tune.py
was required train the open llama. But that code is still incredibly messy and I will PR it once I clean it up a bit.The reasons this PR is a draft are:
llmstep ""
. For now it is best to not move your cursor after typing the empty string, and actually putting text in the string almost always causes a crash.llmstep
call. The CPU won't handle parallel decoding well like the GPU can, but we might want to add a "get another suggestion" button or something of the sort.