Optimizing Components of fastRAG on Intel Hardware

Models can be further optimized through software frameworks to improve latency and throughput. Software packages such as optimum-intel developed by Intel and partners are designed to leverage the CPU extensions found in the most recent Intel processors. Transformer-based models can undergo quantization, sparsification, or enhancement through knowledge distillation by utilizing the optimum-intel library.

Quantization

Quantization is a process that minimizes both computational overhead and memory footprint during inference. This is achieved by adopting lower-precision data types, such as 8-bit integers (int8), instead of the standard 32-bit floating-point numbers (float32) to represent model weights and activations. To facilitate these optimizations, frameworks like the Intel Extension for Pytorch and optimum-intel provide specialized support for the latest Intel CPU features.

Why should we optimize using quantization?

Reduction in bit count leads to a model that requires less memory storage, potentially reduces energy consumption, and enables faster operations, such as matrix multiplication, through integer arithmetic.

Available Optimizations

	framework	backend
LLM Quantization	`optimum-intel`	CPU
Bi-encoder Quantization	`optimum-intel`	CPU
Cross-encoder Quantization	`neural-compressor`, `ipex`	CPU
LlamaCPP LLMs	`llama_cpp`	CPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Optimizing Components of fastRAG on Intel Hardware

Quantization

Available Optimizations

Files

README.md

Latest commit

History

README.md

File metadata and controls

Optimizing Components of fastRAG on Intel Hardware

Quantization

Available Optimizations