Yi-9b模型4bits量化报错，请问如何解决 #457

codeman0987 · 2024-03-13T11:59:01Z

codeman0987
Mar 13, 2024

使用官方代码做AWQ量化

python quantization/awq/quant_autoawq.py --model models/01-ai__Yi-9B/ --output_dir models/yi-9b-int4 --bits 4 --group_size 128 --trust_remote_code

报错如下：
Generating validation split: 214670 examples [00:03, 55198.64 examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (8947 > 4096). Running this sequence through the model will result in indexing errors
AWQ: 2%|████▏ | 1/48 [00:22<17:49, 22.75s/it]
Traceback (most recent call last):
File "quantization/awq/quant_autoawq.py", line 53, in
run_quantization(args)
File "quantization/awq/quant_autoawq.py", line 21, in run_quantization
model.quantize(tokenizer, quant_config=quant_config)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/awq/models/base.py", line 176, in quantize
self.quantizer.quantize()
File "/opt/conda/lib/python3.8/site-packages/awq/quantize/quantizer.py", line 147, in quantize
input_feat = self._get_input_feat(self.modules[i], named_linears)
File "/opt/conda/lib/python3.8/site-packages/awq/quantize/quantizer.py", line 535, in _get_input_feat
self.inps = layer(self.inps, **module_kwargs)[0]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 740, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 377, in forward
causal_mask = attention_mask[:, :, cache_position, : key_states.shape[-2]]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)

itsliupeng · 2024-03-18T03:43:53Z

itsliupeng
Mar 18, 2024
Collaborator

我这边只试过 llama.cpp 的 q4 量化，instruction 生成体感还比较好, 供参考

./quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yi-9b模型4bits量化报错，请问如何解决 #457

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Yi-9b模型4bits量化报错，请问如何解决 #457

codeman0987 Mar 13, 2024

Replies: 1 comment

itsliupeng Mar 18, 2024 Collaborator

codeman0987
Mar 13, 2024

itsliupeng
Mar 18, 2024
Collaborator