inference-optimization

Here are 37 public repositories matching this topic...

google / XNNPACK

High-efficiency floating-point neural network inference operators for mobile, server, and Web

cpu neural-network inference multithreading simd matrix-multiplication neural-networks convolutional-neural-networks convolutional-neural-network inference-optimization mobile-inference

Updated Jan 24, 2025
C

alibaba / BladeDISC

Star

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.

machine-learning deep-learning neural-network compiler tensorflow pytorch inference-optimization mlir

Updated Dec 30, 2024
C++

jiazhihao / TASO

Star

The Tensor Algebra SuperOptimizer for Deep Learning

deep-neural-networks deep-learning inference-optimization

Updated Jan 26, 2023
C++

mit-han-lab / inter-operator-scheduler

Star

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration

acceleration cnn parallelism inference-optimization

Updated Apr 27, 2022
C++

imedslab / pytorch_bn_fusion

Star

Batch normalization fusion for PyTorch

deep-neural-networks deep-learning pytorch batch-normalization inference-optimization

Updated Apr 6, 2020
Python

ZFTurbo / Keras-inference-time-optimizer

Star

Optimize layers structure of Keras model to reduce computation time

keras inference-optimization

Updated Jul 18, 2020
Python

Rapternmn / PyTorch-Onnx-Tensorrt

Star

A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3

pytorch darknet tensorrt onnx onnx-torch yolov3 inference-optimization onnxruntime

Updated Dec 31, 2019
Python

BaiTheBest / SparseLLM

Star

Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)

pruning model-compression inference-optimization alternating-optimization large-language-models efficient-ai

Updated Dec 20, 2024
Python

keli-wen / AGI-Study

Star

The blog, read report and code example for AGI/LLM related knowledge.

demo train code-examples inference-optimization llm

Updated Jan 10, 2025
Python

lmaxwell / Armednn

Star

cross-platform modular neural network inference library, small and efficient

neural-network eigen lstm inference-engine eigen3 inference-optimization conv1d

Updated May 15, 2023
C++

Harly-1506 / Faster-Inference-yolov8

Star

Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢

opencv image-processing torch segmentation object-detection numpy-arrays openvino inference-optimization openvino-toolkit numpy-implementation ultralytics yolov8

Updated Dec 8, 2024
Python

ksm26 / Efficiently-Serving-LLMs

Star

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.

text-generation batch-processing server-optimization model-serving model-acceleration inference-optimization optimization-techniques machine-learning-operations deep-learning-techniques model-inference-service performance-enhancement scalability-strategies serving-infrastructure large-scale-deployment

Updated Apr 12, 2024
Jupyter Notebook

grazder / template.cpp

Star

[WIP] A template for getting started writing code using GGML

deep-learning cpp inference-optimization ggml

Updated May 1, 2024
C++

amazon-science / llm-rank-pruning

Star

LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.

pagerank graph-theory pruning inference-optimization weighted-pagerank large-language-models llm llms

Updated Nov 29, 2024
Python

EZ-Optimium / Optimium

Star

Your AI Catalyst: inference backend to maximize your model's inference performance

raspberry-pi arm deep-learning neural-network runtime amd intel inference inference-engine tensorflow-lite inference-optimization mediapipe ai-compiler

Updated Dec 10, 2024
C++

ccs96307 / fast-llm-inference

Star

Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.

acceleration inference-optimization large-language-models speculative-decoding