Model Updates

Note

Please refer to the front-page README for the latest verified release for each model.

August 26, 2024

Falcon7B

Added data parallel demo for a single Galaxy (32 chips)
Refactored all modules and tests to use ttnn multi-device tensors

Llama 3.1 - 8B

Note: This feature is available as of release v0.51.0-rc33

Added multi-batching support to the demo for running multiple batches of users consecutively

Mixtral7Bx8

Improved end-to-end performance through optimizations to the attention mask in flash decoding

August 12, 2024

Llama 3.1 - 8B

Added support for flash decoding

Mistral7B

Updated the demo to support multiple batches of users

Mamba-2.8B

Updated the demo to use the full prefill graph instead of processing a single token of the prompt at a time using decode

Mixtral7Bx8

Added support for decode with 32K context length using flash decoding
Fused mixture of experts into a single operation using ttnn.moe

July 29, 2024

Llama 3.1 - 8B

Added support for LLaMA 3.1 - 8B
Runs fast prefill for sequence lengths of up to 512 tokens
Supports a maximum context length of 8K tokens

Llama 3/3.1 - 70B

Added support for LLaMA 3.1 70B (new scaled rotary position embeddings)
Prefill and decode now support 8K context length with batch size 16

Mistral7B

Added prefill support for 4K context length, using scaled dot product attention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODEL_UPDATES.md

MODEL_UPDATES.md

Model Updates

August 26, 2024

Falcon7B

Llama 3.1 - 8B

Mixtral7Bx8

August 12, 2024

Llama 3.1 - 8B

Mistral7B

Mamba-2.8B

Mixtral7Bx8

July 29, 2024

Llama 3.1 - 8B

Llama 3/3.1 - 70B

Mistral7B

Files

MODEL_UPDATES.md

Latest commit

History

MODEL_UPDATES.md

File metadata and controls

Model Updates

August 26, 2024

August 12, 2024

July 29, 2024