Note
Please refer to the front-page README for the latest verified release for each model.
- Added data parallel demo for a single Galaxy (32 chips)
- Refactored all modules and tests to use ttnn multi-device tensors
Note: This feature is available as of release v0.51.0-rc33
- Added multi-batching support to the demo for running multiple batches of users consecutively
- Improved end-to-end performance through optimizations to the attention mask in flash decoding
- Added support for flash decoding
- Updated the demo to support multiple batches of users
- Updated the demo to use the full prefill graph instead of processing a single token of the prompt at a time using decode
- Added support for decode with 32K context length using flash decoding
- Fused mixture of experts into a single operation using
ttnn.moe
- Added support for LLaMA 3.1 - 8B
- Runs fast prefill for sequence lengths of up to 512 tokens
- Supports a maximum context length of 8K tokens
- Added support for LLaMA 3.1 70B (new scaled rotary position embeddings)
- Prefill and decode now support 8K context length with batch size 16
- Added prefill support for 4K context length, using scaled dot product attention