Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformers integration section #1904

Merged
merged 7 commits into from
Mar 15, 2024

Conversation

younesbelkada
Copy link
Contributor

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the transformers integration section !

quanto-introduction.md Outdated Show resolved Hide resolved
quanto-introduction.md Outdated Show resolved Hide resolved
quanto-introduction.md Outdated Show resolved Hide resolved
quanto-introduction.md Outdated Show resolved Hide resolved
@SunMarc SunMarc requested a review from dacorvo March 14, 2024 20:49
)
```

You can quantize the weights and / or activations in int8, float8, int4 or int2 by simply passing the correct argument in `QuantoConfig`. The activations can be either in int8 or float8 - note for float8 you need to have hardware that is compatible with float8 precision, otherwise quanto will silently upcast the weights and activations to torch.float32 or torch.float16 (depending on the original data type of the model) when we perform the matmul
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you describe here also happens if activations are not quantized, because mixed mm are not supported by any hardware. Note also that pytorch will raise an exception if you try to use float8 on MPS device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, ! I added something in the lines you suggested


Quanto is device agnostic, meaning you can quantize your model regardless if you are on CPU / GPU / MPS (Apple Silicon) therefore you can run quantized models on any of these devices.

Quanto is also torch.compile friendly, you can quantize a model with quanto and call `torch.compile` to the model to compile it for faster generation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work if there is some dynamic quantization is involved, ie:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added few explanations of the edge cases in b778f8d ! Also note one can change the activation type with transformers integration, so users should keep them to None for torch.compile

@younesbelkada younesbelkada requested a review from dacorvo March 15, 2024 09:09
@dacorvo dacorvo merged commit fa24d45 into DavidCorvoysier/quanto Mar 15, 2024
@dacorvo dacorvo deleted the younesbelkada-patch-add-quanto branch March 15, 2024 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants