Skip to content

Latest commit

 

History

History
171 lines (151 loc) · 9.38 KB

README_GPU.md

File metadata and controls

171 lines (151 loc) · 9.38 KB

GPU

GPU via CUDA is supported via Hugging Face type models and LLaMa.cpp models.

Google Colab

A Google Colab version of a 3B GPU model is at:

h2oGPT GPU

A local copy of that GPU Google Colab is h2oGPT_GPU.ipynb.


GPU (CUDA)

For help installing cuda toolkit, see CUDA Toolkit.

git clone https://github.com/h2oai/h2ogpt.git
cd h2ogpt
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r reqs_optional/requirements_optional_langchain.txt
pip install -r reqs_optional/requirements_optional_gpt4all.txt
pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt
pip install -r reqs_optional/requirements_optional_langchain.urls.txt
# Optional: support docx, pptx, ArXiv, etc.
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libtesseract-dev libreoffice
# Optional: for supporting unstructured package
python -m nltk.downloader all

then check that can see CUDA from Torch:

import torch
print(torch.cuda.is_available())

should print True.

To run in ChatBot mode, do:

python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=True

Then point browser at http://0.0.0.0:7860 (linux) or http://localhost:7860 (windows/mac) or the public live URL printed by the server (disable shared link with --share=False). For 4-bit or 8-bit support, older GPUs may require older bitsandbytes installed as pip uninstall bitsandbytes -y ; pip install bitsandbytes==0.38.1. For production uses, we recommend at least the 12B model, ran as:

python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b --load_8bit=True

and one can use --h2ocolors=False to get soft blue-gray colors instead of H2O.ai colors. Here is a list of environment variables that can control some things in generate.py.

Note if you download the model yourself and point --base_model to that location, you'll need to specify the prompt_type as well by running:

python generate.py --base_model=<user path> --load_8bit=True --prompt_type=human_bot

for some user path <user path> and the prompt_type must match the model or a new version created in prompter.py or added in UI/CLI via prompt_dict.

For quickly using a private document collection for Q/A, place documents (PDFs, text, etc.) into a folder called user_path and run

python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b  --load_8bit=True --langchain_mode=UserData --user_path=user_path

For more details about document Q/A, see LangChain Readme.

For 4-bit support, when running generate pass --load_4bit=True, which is only supported for certain architectures like GPT-NeoX-20B, GPT-J, LLaMa, etc.

Any other instruct-tuned base models can be used, including non-h2oGPT ones. Larger models require more GPU memory.

AutoGPTQ

To support AutoGPTQ models, run:

pip install auto-gptq[triton]

although to avoid building the package you can run the specific version, e.g.

pip install https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu118-cp310-cp310-linux_x86_64.whl

However, if one sees issues like CUDA extension not installed. mentioned during loading of model, need to recompile, because, otherwise, the generation will be much slower even if uses GPU. If you have CUDA 11.8 installed from NVIDIA, run:

pip uninstall -y auto-gptq ; CUDA_HOME=/usr/local/cuda-11.8 GITHUB_ACTIONS=true pip install auto-gptq --no-cache-dir

If one used conda cudatoolkit:

conda install -c conda-forge cudatoolkit-dev

then use that location instead:

pip uninstall -y auto-gptq ; CUDA_HOME=$CONDA_PREFIX GITHUB_ACTIONS=true pip install auto-gptq --no-cache-dir

An example with AutoGPTQ is:

python generate.py --base_model=TheBloke/Nous-Hermes-13B-GPTQ --score_model=None --load_gptq=nous-hermes-13b-GPTQ-4bit-128g.no-act.order --use_safetensors=True --prompt_type=instruct --langchain_mode=UserData

This will use about 9800MB. You can also add --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 to save some memory on embedding to reach 9340MB.

For LLaMa2 70B model quantized in 4-bit AutoGPTQ, can run:

CUDA_VISIBLE_DEVICES=0 python generate.py --base_model=Llama-2-70B-chat-GPTQ --load_gptq="gptq_model-4bit--1g" --use_safetensors=True --prompt_type=llama2 --save_dir='70bgptq4bit`

which gives about 12 tokens/sec. For 7b run:

python generate.py --base_model=TheBloke/Llama-2-7b-Chat-GPTQ --load_gptq="gptq_model-4bit-128g" --use_safetensors=True --prompt_type=llama2 --save_dir='7bgptq4bit`

For full 16-bit with 16k context across all GPUs:

python generate.py --base_model=meta-llama/Llama-2-70b-chat-hf --prompt_type=llama2 --rope_scaling="{'type': 'linear', 'factor': 4}" --use_gpu_id=False --save_dir=savemeta70b

and running on 4xA6000 gives about 4tokens/sec consuming about 35GB per GPU of 4 GPUs when idle. Currently, Hugging Face transformers does not support GPTQ directly except in text-generation-inference (TGI) server, but TGI does not support RoPE scaling. Also, vLLM supports LLaMa2 and AutoGPTQ but not RoPE scaling. Only exllama supports AutoGPTQ with RoPE scaling.


GPU with LLaMa

  • Install langchain, and GPT4All, and python LLaMa dependencies:
pip install -r reqs_optional/requirements_optional_langchain.txt
pip install -r reqs_optional/requirements_optional_gpt4all.txt

then compile llama-cpp-python with CUDA support:

conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit  # maybe optional
pip uninstall -y llama-cpp-python
export LLAMA_CUBLAS=1
export CMAKE_ARGS=-DLLAMA_CUBLAS=on
export FORCE_CMAKE=1
export CUDA_HOME=$HOME/miniconda3/envs/h2ogpt
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.68 --no-cache-dir --verbose

and uncomment # n_gpu_layers=20 in .env_gpt4all, one can try also 40 instead of 20. If one sees /usr/bin/nvcc mentioned in errors, that file needs to be removed as would likely conflict with version installed for conda. Then run:

python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path

when loading you should see something like:

Using Model llama
Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents
load INSTRUCTOR_Transformer
max_seq_length  512
0it [00:00, ?it/s]
0it [00:00, ?it/s]
Loaded 0 sources for potentially adding to UserData
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti
  Device 1: NVIDIA GeForce RTX 2080
llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 1792
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device
llama_model_load_internal: mem required  = 4518.85 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 368 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 20 repeating layers to GPU
llama_model_load_internal: offloaded 20/35 layers to GPU
llama_model_load_internal: total VRAM used: 4470 MB
llama_new_context_with_model: kv self size  =  896.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'wizard2', 'prompt_dict': {'promptA': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'promptB': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'PreInstruct': '\n### Instruction:\n', 'PreInput': None, 'PreResponse': '\n### Response:\n', 'terminate_response': ['\n### Response:\n'], 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '\n### Instruction:\n', 'botstr': '\n### Response:\n', 'generates_leading_space': False}}
Running on local URL:  http://0.0.0.0:7860
Running on public URL: https://1ccb24d03273a3d085.gradio.live

and GPU usage when using. Note that once llama-cpp-python is compiled to support CUDA, it no longer works for CPU mode, so one would have to reinstall it without the above options to recovers CPU mode or have a separate h2oGPT env for CPU mode.