Deepspeed inference and infinity offload with bitsandbytes 4bit loaded model #3780

ocesp98 · 2023-06-21T11:58:50Z

ocesp98
Jun 21, 2023

Is it possible to use deepspeed inference with a 4/8-bit quantized model using bitsandbytes?

I use the bitsandbytes package like this:

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained('model_id', device_map="auto", quantization_config=nf4_config)

zero_config =  {
        "stage": 3,
        "offload_param": {
            "device": "cpu"
            }
        }

ds_model = deepspeed.init_inference(
            model=model,     
            mp_size=1,        
            zero=zero_config
        )

 pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=self.max_new_tokens
            )

However, it throws an error:

ValueError: .to is not supported for 4-bit or 8-bit models. Please use the model as it is, since the model
has already been set to the correct devices and casted to the correct dtype.

The ultimate goal is to combine the quantization with deepspeed zero infinity offload in the hope to run a larger model that currently does not fit on my GPU.

rubenCrayon · 2023-09-27T11:52:03Z

rubenCrayon
Sep 27, 2023

Hi! Any update on this error? I replicated it using transformers==4.33.2 and deepspeed==0.10.3 and the response differ a bit but the root is the same as before:

ValueError: .half() is not supported for 4-bit or 8-bit models. Please use the model as it is, since the model has already been casted to the correct dtype.

0 replies

tjruwase · 2023-09-27T14:10:46Z

tjruwase
Sep 27, 2023
Maintainer

@ocesp98, @rubenCrayon can you please try the updated zero-inference with 4bit quantization?
https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/zero_inference/README.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepspeed inference and infinity offload with bitsandbytes 4bit loaded model #3780

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Deepspeed inference and infinity offload with bitsandbytes 4bit loaded model #3780

ocesp98 Jun 21, 2023

Replies: 2 comments

rubenCrayon Sep 27, 2023

tjruwase Sep 27, 2023 Maintainer

ocesp98
Jun 21, 2023

rubenCrayon
Sep 27, 2023

tjruwase
Sep 27, 2023
Maintainer