-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add transformers integration section #1904
Add transformers integration section #1904
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the transformers integration section !
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
quanto-introduction.md
Outdated
) | ||
``` | ||
|
||
You can quantize the weights and / or activations in int8, float8, int4 or int2 by simply passing the correct argument in `QuantoConfig`. The activations can be either in int8 or float8 - note for float8 you need to have hardware that is compatible with float8 precision, otherwise quanto will silently upcast the weights and activations to torch.float32 or torch.float16 (depending on the original data type of the model) when we perform the matmul |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you describe here also happens if activations are not quantized, because mixed mm are not supported by any hardware. Note also that pytorch will raise an exception if you try to use float8 on MPS device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, ! I added something in the lines you suggested
quanto-introduction.md
Outdated
|
||
Quanto is device agnostic, meaning you can quantize your model regardless if you are on CPU / GPU / MPS (Apple Silicon) therefore you can run quantized models on any of these devices. | ||
|
||
Quanto is also torch.compile friendly, you can quantize a model with quanto and call `torch.compile` to the model to compile it for faster generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work if there is some dynamic quantization is involved, ie:
- QAT,
- quantized activations.
QTensor cannot be created from inside a dynamo graph optimum-quanto#46
So basically for frozen models it is ok (I think this is always the case in transformers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added few explanations of the edge cases in b778f8d ! Also note one can change the activation type with transformers integration, so users should keep them to None
for torch.compile
cc @dacorvo @SunMarc