Error with LoRA Weights Data Type in Quantized TensorRT-LLM Model Execution #2628

Alireza3242 · 2024-12-25T11:00:39Z

I quantized a gemma model with AWQ. Now I want to use LoRA at runtime. However, when I send the LoRA weights and ask it to compute, I receive the following error:

[TensorRT-LLM][ERROR] Encountered an error when fetching new request: [TensorRT-LLM][ERROR] Assertion failed: Expected lora weights to be the same data type as base model (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/loraUtils.cpp:66)
1       0x7f6918b7bc64 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f6918b8b005 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x78a005) [0x7f6918b8b005]
3       0x7f691afad798 tensorrt_llm::batch_manager::PeftCacheManager::addRequestPeft(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, bool) + 184
4       0x7f691afd0242 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 82
5       0x7f691b0256bf tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional<float>) + 2543
6       0x7f691b027698 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1176
7       0x7f69e86b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f69e86b0253]
8       0x7f69e843fac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f69e843fac3]
9       0x7f69e84d0a04 clone + 68

Steps Taken:

1- Using tensorrt-0.15
2- In the file: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/top_model_mixin.py I added the following:

from .lora_manager import LoraConfig, use_lora

And in the TopModelMixin class, I wrote:

def use_lora(self, lora_config: LoraConfig):
    use_lora(self, lora_config)

3- Running:

python3 /app/src/quantization/quantize.py --model_dir /app/data/gemma2_27b/model --dtype bfloat16 --qformat int4_awq --output_dir /app/data/tllm_checkpoint --awq_block_size 128 --calib_size 512 --calib_dataset /app/src/quantization/dataset

4- Running:

trtllm-build --checkpoint_dir /app/data/tllm_checkpoint --output_dir /app/model_repository/tensorrt_llm/1 --gemm_plugin auto --max_batch_size 32 --max_input_len 4096 --max_num_tokens 8192 --lora_plugin auto --lora_dir /app/data/gemma2_27b/lora/torch/1 --max_lora_rank 16 --lora_target_modules attn_qkv attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_4h_to_h mlp_gate

5- Running:

python3 /app/src/lora/hf_lora_convert.py --in-file /app/data/gemma2_27b/lora/torch/1 --storage-type float16 --out-dir /app/data/gemma2_27b/lora/numpy/1

6- Starting Triton server:

tritonserver --model-repository /app/src/../model_repository

7- Running:

python3 /app/src/lora/inflight_batcher_llm_client.py --top-k 0 --top-p 0.5 --request-output-len 10 --text hello --tokenizer-dir /app/data/gemma2_27b/lora/torch/1 --lora-path /app/data/gemma2_27b/lora/numpy/1 --lora-task-id 1 --streaming

The text was updated successfully, but these errors were encountered:

nv-guomingz added the Lora/P-tuning label Jan 6, 2025

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with LoRA Weights Data Type in Quantized TensorRT-LLM Model Execution #2628

Error with LoRA Weights Data Type in Quantized TensorRT-LLM Model Execution #2628

Alireza3242 commented Dec 25, 2024

Error with LoRA Weights Data Type in Quantized TensorRT-LLM Model Execution #2628

Error with LoRA Weights Data Type in Quantized TensorRT-LLM Model Execution #2628

Comments

Alireza3242 commented Dec 25, 2024