Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with LoRA Weights Data Type in Quantized TensorRT-LLM Model Execution #2628

Open
Alireza3242 opened this issue Dec 25, 2024 · 0 comments
Labels
Investigating Lora/P-tuning triaged Issue has been triaged by maintainers

Comments

@Alireza3242
Copy link

I quantized a gemma model with AWQ. Now I want to use LoRA at runtime. However, when I send the LoRA weights and ask it to compute, I receive the following error:

[TensorRT-LLM][ERROR] Encountered an error when fetching new request: [TensorRT-LLM][ERROR] Assertion failed: Expected lora weights to be the same data type as base model (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/loraUtils.cpp:66)
1       0x7f6918b7bc64 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f6918b8b005 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x78a005) [0x7f6918b8b005]
3       0x7f691afad798 tensorrt_llm::batch_manager::PeftCacheManager::addRequestPeft(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, bool) + 184
4       0x7f691afd0242 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 82
5       0x7f691b0256bf tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional<float>) + 2543
6       0x7f691b027698 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1176
7       0x7f69e86b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f69e86b0253]
8       0x7f69e843fac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f69e843fac3]
9       0x7f69e84d0a04 clone + 68

Steps Taken:

1- Using tensorrt-0.15
2- In the file: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/top_model_mixin.py I added the following:

from .lora_manager import LoraConfig, use_lora

And in the TopModelMixin class, I wrote:

def use_lora(self, lora_config: LoraConfig):
    use_lora(self, lora_config)

3- Running:

python3 /app/src/quantization/quantize.py --model_dir /app/data/gemma2_27b/model --dtype bfloat16 --qformat int4_awq --output_dir /app/data/tllm_checkpoint --awq_block_size 128 --calib_size 512 --calib_dataset /app/src/quantization/dataset

4- Running:

trtllm-build --checkpoint_dir /app/data/tllm_checkpoint --output_dir /app/model_repository/tensorrt_llm/1 --gemm_plugin auto --max_batch_size 32 --max_input_len 4096 --max_num_tokens 8192 --lora_plugin auto --lora_dir /app/data/gemma2_27b/lora/torch/1 --max_lora_rank 16 --lora_target_modules attn_qkv attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_4h_to_h mlp_gate

5- Running:

python3 /app/src/lora/hf_lora_convert.py --in-file /app/data/gemma2_27b/lora/torch/1 --storage-type float16 --out-dir /app/data/gemma2_27b/lora/numpy/1

6- Starting Triton server:

tritonserver --model-repository /app/src/../model_repository

7- Running:

python3 /app/src/lora/inflight_batcher_llm_client.py --top-k 0 --top-p 0.5 --request-output-len 10 --text hello --tokenizer-dir /app/data/gemma2_27b/lora/torch/1 --lora-path /app/data/gemma2_27b/lora/numpy/1 --lora-task-id 1 --streaming
@github-actions github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Investigating Lora/P-tuning triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants