Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformers integration section #1904

Merged
merged 7 commits into from
Mar 15, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 45 additions & 1 deletion quanto-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,51 @@ TO BE COMPLETED

## Integration in 🤗 transformers

TO BE COMPLETED

(note this is the paragraph of “transformers integration” in https://github.com/huggingface/blog/pull/1832 )
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

Quanto library is seamlessly integrated in Hugging Face transformers library. You can quantize any model and push it on the Hub by passing a `QuantoConfig` to `from_pretrained`!

younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
Currently you need to use the latest version of accelerate to make sure that the integration is fully compatible.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

quantization_config = QuantoConfig(weights="int8")

quantized_model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config= quantization_config
)
```

You can quantize the weights and / or activations in int8, float8, int4 or int2 by simply passing the correct argument in `QuantoConfig`. The activations can be either in int8 or float8 - note for float8 you need to have hardware that is compatible with float8 precision, otherwise quanto will silently upcast the weights and activations to torch.float32 or torch.float16 (depending on the original data type of the model) when we perform the matmul
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you describe here also happens if activations are not quantized, because mixed mm are not supported by any hardware. Note also that pytorch will raise an exception if you try to use float8 on MPS device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, ! I added something in the lines you suggested


Quanto is device agnostic, meaning you can quantize your model regardless if you are on CPU / GPU / MPS (Apple Silicon) therefore you can run quantized models on any of these devices (with an exception for float8 precision).
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

Quanto is also torch.compile friendly, you can quantize a model with quanto and call `torch.compile` to the model to compile it for faster generation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work if there is some dynamic quantization is involved, ie:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added few explanations of the edge cases in b778f8d ! Also note one can change the activation type with transformers integration, so users should keep them to None for torch.compile


Note it is also possible to quantize any model, regardless of the modality using quanto ! We demonstrate how to quantize `openai/whisper-large-v3` model in int8 using quanto.

```python
from transformers import AutoModelForSpeechSeq2Seq

model_id = "openai/whisper-large-v3"
quanto_config = QuantoConfig(weights="int8")


model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="cuda",
quantization_config=quanto_config
)
```

Check out this notebook for a complete tutorial on how to properly use quanto & transformers !
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

## Contributing to 🤗 quanto

Expand Down