Skip to content

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

License

Notifications You must be signed in to change notification settings

minmin-intel/intel-extension-for-transformers

 
 

Repository files navigation

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference   |   💻Examples   |   📖Documentations

🚀Latest News

  • NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next'23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
  • NeuralChat supports custom chatbot development and deployment on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out Notebooks and see below sample code.
# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
  • LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting mainstream low precision data types such as INT8/FP8/INT4/FP4/NF4.

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For more installation methods, please refer to Installation Page

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:

🌱Getting Started

Below are the sample code to enable weight-only low precision inference. See more examples.

INT4 Inference

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig

model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
gen_tokens = model.generate(inputs, max_new_tokens=300)
outputs = tokenizer.batch_decode(gen_tokens)

INT8 Inference

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig

model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int8")
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
gen_tokens = model.generate(inputs, max_new_tokens=300)
outputs = tokenizer.batch_decode(gen_tokens)

🎯Validated Models

Here is the average accuracy of validated models on Lambada (OpenAI), HellaSwag, Winogrande, PIQA, and WikiText. The next token latency is based on 32 input tokens and greedy search on Intel's 4th Generation Xeon Scalable Sapphire Rapids processor.

Model FP32 INT4 Accuracy (Group size 32) INT4 Accuracy (Group size 128) Next Token Latency
EleutherAI/gpt-j-6B 0.643 0.644 0.64 21.98ms
meta-llama/Llama-2-7b-hf 0.69 0.69 0.685 24.55ms
decapoda-research/llama-7b-hf 0.689 0.682 0.68 24.84ms
EleutherAI/gpt-neox-20b 0.674 0.672 0.669 80.16ms
mosaicml/mpt-7b-chat 0.672 0.67 0.666 35.84ms
tiiuae/falcon-7b 0.698 0.694 0.693 36.1ms
baichuan-inc/baichuan-7B 0.474 0.471 0.47 Coming Soon
facebook/opt-6.7b 0.65 0.647 0.643 Coming Soon
databricks/dolly-v2-3b 0.613 0.609 0.609 22.02ms
tiiuae/falcon-40b-instruct 0.756 0.757 0.755 Coming Soon

Find other models like ChatGLM, ChatGLM2, StarCoder... in LLM Runtime

📖Documentation

OVERVIEW
NeuralChat LLM Runtime
NEURALCHAT
Chatbot on Intel CPU Chatbot on Intel GPU Chatbot on Gaudi
Chatbot on Client More Notebooks
LLM RUNTIME
LLM Runtime Streaming LLM Low Precision Kernels Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8) Weight-only Quantization (INT4/FP4/NF4/INT8) QLoRA on CPU
GENERAL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics Objectives
Pipeline Length Adaptive Early Exit Data Augmentation
TUTORIALS & RESULTS
Tutorials LLM List General Model List Model Performance

📃Selected Publications/Events

View Full Publication List.

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us and look forward to our collaborations on Intel Extension for Transformers!

About

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 45.5%
  • Python 35.7%
  • HTML 9.5%
  • C 3.5%
  • Jupyter Notebook 2.5%
  • Svelte 1.7%
  • Other 1.6%