Model | Batch | Hardware | ttft (s) | t/s/u | Target t/s/u | Release |
---|---|---|---|---|---|---|
Falcon7B-decode | 32 | e150 | 4.2 | 4.4 | ||
Falcon7B | 32 | n150 | 0.08 | 16.7 | 26 | v0.51.0-rc24 |
Mistral-7B | 32 | n150 | 9.9 | 25 | v0.51.0-rc28 | |
Mamba-2.8B | 32 | n150 | 0.04 | 12.3 | 41 | v0.51.0-rc26 |
LLaMA-3.1-8B | 1 | n150 | 8.3 | 23 | v0.51.0-rc28 | |
Falcon7B (data parallel) | 256 | LoudBox | 0.11 | 13.4 | 26 | v0.51.0-rc36 |
LLaMA-2-70B - (tensor parallel) | 32 | LoudBox | 10.4 | 20 | v0.51.0-rc36 | |
LLaMA-3.1-70B (tensor parallel) | 32 | LoudBox | 10.4 | 20 | v0.51.0-rc36 | |
Falcon40B (tensor parallel) | 32 | LoudBox | 5.3 | 36 | v0.51.0-rc35 | |
Mixtral7Bx8 (tensor parallel) | 32 | LoudBox | 0.19 | 15.7 | 33 | v0.51.0-rc33 |
Falcon7B (data parallel) | 1024 | Galaxy | 0.30 | 4.0 | 26 | v0.51.0-rc30 |
Notes:
- The reported LLM performance is for an input sequence length (number of rows filled in the KV cache) of 128 for all models except Mamba (which can accept any sequence length).
- The t/s/u reported is the throughput of the first token generated after prefill, i.e. 1 / inter token latency.
Model | Batch | Hardware | fps | Target fps | Release |
---|---|---|---|---|---|
ResNet-50 (224x224) | 20 | e150 | 5,100 | 10,000 | |
ResNet-50 (224x224) | 16 | n150 | 4,100 | 7,000 | |
ResNet-50 (224x224) (data parallel) | 128 | LoudBox | 31,250 | 56,000 | |
ResNet-50 (224x224) (data parallel) | 512 | Galaxy | 54,100 | 224,000 | |
ResNet-50 (224x224) (data parallel) | 1024 | Two Galaxies | 107,000 | 448,000 | |
ViT | 8 | e150 | 860 | 2,000 | |
Stable Diffusion 1.4 (512x512) | 1 | n150 | 0.167 | 0.3 |
Model | Batch | Hardware | sen/sec | Target sen/sec | Release |
---|---|---|---|---|---|
BERT-Large | 12 | e150 | 370 | 410 | |
BERT-Large | 8 | n150 | 270 | 400 | |
T5 small | e150 | 140 | |||
Bloom | e150 | 70 |
For the latest model updates and features, please see MODEL_UPDATES.md
- Advanced Performance Optimizations for Models (updated Sept 8th)
- Programming Mesh of Devices (updated Sept 9th)
TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.
Get started with simple kernels.
- Matrix Engine (updated Sept 6th)
- Tensor Layouts (updated Sept 6th)
- Data Formats (updated Sept 7th)
- Saturating DRAM Bandwidth (updated Sept 6th)
- Flash Attention on Wormhole (updated Sept 6th)
- CNNs on TT Architectures (updated Sept 6th)