Skip to content

Commit

Permalink
Merge pull request #18 from dusty-nv/20231012-oogabooga
Browse files Browse the repository at this point in the history
more oogabooga updates
  • Loading branch information
dusty-nv authored Oct 13, 2023
2 parents c9744fd + cff39a3 commit d71c661
Showing 1 changed file with 14 additions and 26 deletions.
40 changes: 14 additions & 26 deletions docs/tutorial_text-generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,42 +25,34 @@ Interact with a local AI assistant by running a LLM with oobabooga's [`text-gene

[^1]: Limited to 7B model (4-bit quantized).

## Set up a container for `text-generation-webui`
## Set up a container for text-generation-webui

### Clone `jetson-containers`

!!! tip ""

See the [`jetson-containers/text-generation-webui` container readme](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/text-generation-webui) for more infomation
The jetson-containers project provides pre-built Docker images for [`text-generation-webui`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/text-generation-webui) along with all of the loader API's built with CUDA enabled (llama.cpp, ExLlama, AutoGPTQ, Transformers, ect). You can clone the repo to use its utilities that will automatically pull/start the correct container for you, or you can do it [manually](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/text-generation-webui#user-content-run).

```
git clone https://github.com/dusty-nv/jetson-containers
git clone --depth=1 https://github.com/dusty-nv/jetson-containers
cd jetson-containers
sudo apt update; sudo apt install -y python3-pip
pip3 install -r requirements.txt
```

!!! info

**JetsonHacks** provides an informative walkthrough video on `jetson-containers`, showcasing the usage of both the `stable-diffusion-webui` and `text-generation-webui` containers.

You can find the complete article with detailed instructions [here](https://jetsonhacks.com/2023/09/04/use-these-jetson-docker-containers-tutorial/).
**JetsonHacks** provides an informative walkthrough video on [`jetson-containers`](https://github.com/dusty-nv/jetson-containers), showcasing the usage of both the [`stable-diffusion-webui`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/diffusion/stable-diffusion-webui) and [`text-generation-webui`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/text-generation-webui) containers. You can find the complete article with detailed instructions [here](https://jetsonhacks.com/2023/09/04/use-these-jetson-docker-containers-tutorial/).

<iframe width="720" height="405" src="https://www.youtube.com/embed/HlH3QkS1F5Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

## How to start

> If you are running this for the first time, go through the [pre-setup](#pre-setup).
> If you are running this for the first time, go through the [pre-setup](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md) and see the [`jetson-containers/text-generation-webui` container readme](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/text-generation-webui/README.md)
Use `run.sh` and `autotag` script to automatically pull or build a compatible container image.
Use `run.sh` and `autotag` script to automatically pull or build a compatible container image:

```
cd jetson-containers
./run.sh $(./autotag text-generation-webui)
```

> For other ways to start the container, check the [`jetson-containers/text-generation-webui` container readme](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/text-generation-webui/README.md#user-content-run).
The container has a default run command (`CMD`) that will automatically start the webserver like this:

```
Expand All @@ -83,13 +75,13 @@ See the [oobabooga documentation](https://github.com/oobabooga/text-generation-w

From within the web UI, select **Model** tab and navigate to "**Download model or LoRA**" section.

You can find text generation models on [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), then enter the Hugging Face username/model path (which you can have copied to your clipboard from the Hub). Then click the Download button.
You can find text generation models on [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), then enter the Hugging Face username/model path (which you can have copied to your clipboard from the Hub). Then click the **Download** button.

### GGUF models

The fastest oobabooga model loader to use is currently [llama.cpp](https://github.com/dusty-nv/jetson-containers/blob/dev/packages/llm/llama_cpp) with 4-bit quantized GGUF models.

You can download a single model file for a particular quantization, like `*.Q4_K_M.bin`. Input the file name and hit "Download" button.
You can download a single model file for a particular quantization, like `*.Q4_K_M.bin`. Input the file name and hit **Download** button.

| Model | Quantization | Memory (MB) |
|---------------------------------------------------------------------------------|:-----------------------------:|:-----------:|
Expand All @@ -104,11 +96,7 @@ You can download a single model file for a particular quantization, like `*.Q4_K

### Model selection for Jetson Orin Nano

<span class="blobLightGreen4">Jetson Orin Nano Developer Kit</span> has only 8GB RAM for both CPU (system) and GPU, so you need to pick a model that fits in the RAM size - see the [Model Size](#model-size-tested) section below. The 7B models with 4-bit quantization are the ones to use on Jetson Orin Nano.

Make sure you go through the [RAM optimization](./tips_ram-optimization.md) steps before attempting to load such model on Jetson Orin Nano.

It would still several minutes to load the model as it needs to first load everything to CPU memory and then shuffle it down to GPU memory using swap.
<span class="blobLightGreen4">Jetson Orin Nano Developer Kit</span> has only 8GB RAM for both CPU (system) and GPU, so you need to pick a model that fits in the RAM size - see the [Model Size](#model-size-tested) section below. The 7B models with 4-bit quantization are the ones to use on Jetson Orin Nano. Make sure you go through the [RAM optimization](./tips_ram-optimization.md) steps before attempting to load such model on Jetson Orin Nano.

## Load a model

Expand All @@ -119,27 +107,27 @@ For a GGUF model, remember to
- Set `n-gpu-layers` to `128`
- Set `n_gqa` to `8` if you using Llama-2-70B (on Jetson AGX Orin 64GB)

Then click the "Load" button.
Then click the **Load** button.

## Chat Template

If you're using a Llama model fine-tuned for chat, like the models listed above (except for `LLaMA-30b`), you need to use the oobabooga Instruct mode. On the Parameters tab, go to the Instruction Template sub-tab, then select `Llama-v2` from the Instruction Template drop-down (or Vicuna, Guanaco, ect if you are using those models)
If you're using a Llama model fine-tuned for chat, like the models listed above (except for `LLaMA-30b`), you need to use the oobabooga Instruct mode. On the **Parameters** tab, go to the **Instruction Template** sub-tab, then select `Llama-v2` from the **Instruction Template** drop-down (or Vicuna, Guanaco, ect if you are using those models)

!!! tip ""

For the base text completion models (like `LLaMA-30b`), use the Default or Notebook tab.

Selecting the right template will make sure the model is being [prompted correctly](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) - you can also change the system prompt in the Context box to alter the agent's personality and behavior. There are a lot of other settings under the Generation tab, like the maximum length it should output per reply, and token sampling parameters like [temperature and top_p](https://medium.com/@dixnjakindah/top-p-temperature-and-other-parameters-1a53d2f8d7d7) for controlling randomness.
Selecting the right template will make sure the model is being [prompted correctly](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) - you can also change the system prompt in the **Context** box to alter the agent's personality and behavior. There are a lot of other settings under the Generation tab, like the maximum length it should output per reply, and token sampling parameters like [`temperature` and `top_p`](https://medium.com/@dixnjakindah/top-p-temperature-and-other-parameters-1a53d2f8d7d7) for controlling randomness.

Then change back to the Chat tab, and under the Mode selection, make sure Instruct is selected (confusingly, not chat mode). Then you can start chatting with the LLM!
Then change back to the **Chat** tab, and under the mode section, make sure **Instruct** is selected (confusingly, not chat mode). Then you can start chatting with the LLM!

## Results

![](./images/text-generation-webui_sf-trip.gif)

## Things to do with your LLM

[Here](https://modal.com/docs/guide/ex/vllm_inference#run-the-model) are some common test prompts for coding, math, history ect. You can also ask it about geography, travel, nature, recipies, fixing things, general life advice, and practically everything else. Ask it to tell you about itself. Also Llama-2 is quite playful and likes to play games to test it's logic abilities!
[Here](https://modal.com/docs/guide/ex/vllm_inference#run-the-model) are some common test prompts for coding, math, history ect. You can also ask it about geography, travel, nature, recipies, fixing things, general life advice, and practically everything else. Also Llama-2 is quite playful and likes to play games to test it's logic abilities!

```
>> What games do you like to play?
Expand Down

0 comments on commit d71c661

Please sign in to comment.