Introduction

This page is project tracker to get halo models like llama3, grok1 etc. working on one or more MI3xx using shark/iree.

November 20, 2024 Release Goals

llama3.1 405B sharded across 8 MI300x GPUs producing correct numeical results (P0)
llama3.1 405B sharded across 8 MI300x GPUs performant at level of vLLM PyTorch (Fused Ops Eager Mode) (P1)

(Note: Use llama3.1 8B or 70B to develop and test)

Glossary

TPn: Tensor Parallel using n GPUs where a large tensor is sharded across multiple GPUs using sharktank and scatter/gather to/from GPUs is done in single MLIR

Schedule

(Model is assumed to be llama3.1 in the following table, e.g. "8B FP8" means "llama3.1 8B FP8 model")

Item	10/18/24	10/25/24	11/1/24	11/8/24
Machine and Storage	two 8x MI300x SPX mode ensured working with how to use info added to Nod AI Lab @saienduri (Done:10/17)	-Install 60TB storage on SharkMi300X (Done:10/21) -setup one more 8x air-cooled MI300 machine (SharkMi300X-3) with 60TB added (Done:10/24)@saienduri	-Setup one more 8X MI300 air-cooled machine (SharkMi300X-4) with 60TB @saienduri -Add 30 TB to each of SharkMi300X and SharkMi300X-3 @saienduri
Sharktank Modeling	IREE-compilable 8B FP8 MLIR @dan garvey (Done:10/17)	-verify numerics using quant-dequant on cpu vs run on MI300 for 8B FP8 @dan -Get 8B 70B (Done:10/23) and 405B FP8 MLIR (ETA:10/24) and verify(CPU vs MI300) numerics for 70B @dan, -Wire up Perlexity flow to run vmfb using iree-run-module (score too high, ETA:10/24) @archana, -Debug 70B running OOM on 1 MI300 @kyle (ETA:10/25) -Quantized sharding support (ETA:10/24) @Ian	Re-enerate and Verify MLIR without decomposition of SDPA for 8B, 70B, 405B for FP16 @kyle
Sharding	8 CPU core sharded FP16 numerically verified @boian. compilation issue ETA:10/25	8 GPU sharding for FP16 and FP8 compiling for MI300 @rob/@Ian sharding verified on CPU, compilation fails ETA: 10/25	8 GPU sharding for FP16 and FP8 numerics verified for MI300 @boian/@rob
IREE codegeneration		8B FP16 attention ahead with dynamic shape generating valid vmfb ETA:10/24 @mahesh, FP8 Attention (use Intrinsic for FP8 effectively) (ETA:10/25) @stanley	Paged Attention @kunwar/@manupa	Perf Tuning (all)
Inference Profiling	Tracy profile 8B FP16 w/ decoposition @kyle (Done:10/17)	Tracy profile for 8B FP8 w/ and w/o decomposition @kyle, - Tracy profile 405B with 3 attention blocks w/ decomposition @Avi
Shortfin Serving		llama3.1 8B FP16 iree compiled working using shortfin @xida (KV cache is messed up, needs help)
W/ Serving Inference Performance		llama3.1 8B Fp16 iree compiled working using shortfin performance numbers @avi	Performance tuning for sharding @boin/@rob
Test Automation	-8B FP16 prefill attnhead, decode atttnhead, & full model IREE-compiled perf tests in sharktank CI @avi -8B FP16 IREE-compiled numerics tested using Perlexity @archana	-8B FP8 prefill attnhead, decode atttnhead, full model IREE-compiled perf test in sharktank CI @avi -8B FP8 IREE-compiled numerics tested using Perlexity @archana -8 CPU core sharded 8B FP16 numeric test added @boian	8 GPU sharded 8B FP8 test added @boin
Report dashboard		Show currently runnning all perf and numeric llama3.1 component and full model test reports on a page @saienduri
Release Packaging/testing		Have a test release with 8B FP16 @chris	test release with 8B FP8 @chris

Status-Numerics

(MI300X GPU, SPX Mode)

Item	Generate MLIR	Compile to vmfb	IREE invocation
llama3.1-8B-FP16-decomposed	PASS TP1 mlir gguf irpa	PASS vmfb	PASS
llama3.1-8B-FP16-decomposed-TP8	PASS (MLIR)	PASS
llama3.1-70B-FP16-decomposed	PASS TP1 mlir gguf irpa	PASS vmfb	FAIL OOM
llama3.1-405B-FP16-decomposed	PASS TP1 mlir gguf
llama3.1-8B-FP8-decomposed	PASS TP1 mlir irpa	Fails in iree, patch
llama3.1-70B-FP8-decomposed	PASS TP1 mlir irpa	Fails in iree, patch
llama3.1-405B-FP8-decomposed
llama3.1-8B-FP16	PASS mlir	Fails in iree, patch
llama3.1-70B-FP16	PASS mlir	Fails in iree, patch
llama3.1-405B-FP16
llama3.1-8B-FP8	FAIL qkv must have same data type
llama3.1-70B-FP8	FAIL qkv must have same data type
llama3.1-405B-FP8	FAIL qkv must have same data type
llama-toy-size-FP32-TP2-CPU	PASS	PASS

Status-Benchmark

(MI300X GPU, SPX Mode, Time in ms)

Item	10/25/24	11/1/24	11/8/24	11/15/24	Target(vLLM-PyTorch)
llama3.1-8B-FP16
llama3.1-70B-FP16
llama3.1-405B-FP16
llama3.1-8B-FP8
llama3.1-70B-FP8
llama3.1-405B-FP8

Issues

category	issue link	assigned to	status
iree codegen	18864	unassigned	OOM for 70B
quark quantization	QUARK-71	unassigned	FP8 matmul should be used in attention

AMD GPU Machines

MI300

Test Reports

TBD: Sai please put link to nightly tests that test any of component or full model of llama3

MLIR generation and Compilation

Quantization

export fp8:

python -m sharktank.examples.export_paged_llm_v1 --irpa-file=native_fp8_e4m3fnuz_llama3_8b.irpa --output-mlir native_fp8_e4m3fnuz_llama3_8b.mlir --no-fake-quant --bs=1 --attention-kernel=torch_sdpa

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 <mlir file> -o <vmfb file>

Evaluation tests

Perplexity

Follow the steps here

Accessing sharkblobs on Azure:

In browser, click on sharkblobs , then click on "Blob-containers" and the click on "halo-models"

Or, use command line by first installing az cli as:

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

And then, get the account key for the storage account by clicking on "Storage Accounts" in Azure Services or searching "sharkblobs" in the top search bar. Then, click on sharkblobs. Then, on the left side bar, under Security + networking, click on "Access keys". Copy the account key from here and use in the following command To upload:

az storage blob upload --account-name sharkblobs --container-name sharkblobs --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>

To download:

az storage blob download --account-name sharkblobs --container-name sharkblobs --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>

if you are downloading from "sharkpublic" then replace instructions above by sharkpublic and get your account access key for sharkpublic. Example:

az storage blob download --account-name sharkpublic --container-name sharkpublic --name ian/llama8b_f16.gguf --file llama8b_f16.gguf --account-key <key string>

Archive

Status (Old)

(Note: Do not update this one)

Models	compile	inference (SPX mode)	tracy
llama3.1-8b-FP16	PASS	prefill (1746 ms), decode (71.8 ms), commands	prefill decode
llama3.1-8b-Q4_1	PASS	prefill (1817 ms), decode (57.3 ms), commands	prefill decode
llama3.1-8b-Q4_k	PASS
llama3.1-70b-Q4_1	PASS	prefill (3543 ms), decode (213 ms), commands	prefill decode
llama2-7b-FP8	FAIL
grok-1-Q4_1	PASS	FAIL, out of memory	prefill decode

Goals (Old)

(Note: No longer updated)

Attention Compiler Work
- Dynamic sequence length
- Causal Masking
- Flex attention compilation
LLaMa 8b prefill and decode
- validated numerically correct
- export
- compiled
- benchmarked
- replicate for larger variants
Mixtral prefill and decode
- validated numerically correct
- export
- compiled
- benchmarked
Grok prefill and decode
- validated numerically correct
- export
- compiled
- benchmarked

Tasks and Issues (Old)

(Scheduled for deprecation, move any relevant to Schedule table at top)

task	owner	status	next actions
Sharded LLaMa	boian	In progress	Landing first sharded tests
Export/Compile LLaMa	kyle	blocked on `torch.aten.complex`	rob is authoring fix
LLaMa 8 prefill comparison	rob	layerwise comparison for prefill is normal	handing off tooling to Avi
LLaMa 8 decode comparison	avi	still investigating cause of numeric issue	reuse rob's tooling to investigate
FP8 quantized model	dan	finishing results from quark	following up with Giuseppe on new `fp8 quantization
Model evaluation tooling	archana	Perplexity CI nightly running in eager mode	working on getting perplexity with vmfb

Artifacts (Old)

(Note: Update Schedule-Numerics table for llama3.1 artifacts instead of this table (10/20/2024 onwards))

Guideline:

small files and MLIR files check into llm-dev
large files upload to sharkblobs -> "halo-models" container on Azure and put link to that in the table(s) below
Very large files, store on GPU server and note the name/location of/on the machine in table(s) below

Note: If a link to Azure sharkblob below gives you an error, either use az cli to download (see section Accessing sharkblobs on Azure) or click on sharkblobs , then click on "Blob containers" and then navigate to the file manually and download it.

TP1

Models	FP16	FP8	Q4_1	Q4_K	Attention IRs
llama2-7b		irpa mlir			Attention IRs
llama3-8b	mlir gguf	mlir irpa	mlir gguf	mlir gguf
llama3-70b	mlir gguf	mlir irpa	mlir gguf	mlir gguf
llama3-405b	mlir gguf		mlir gguf	mlir gguf
grok-1	mlir gguf	NA	mlir gguf	gguf

TP8

Models	FP16	FP8	Q4_1	Q4_K
llama3.1-8b
llama3.1-70b
llama3.1-405b
grok-1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

halo-models.md

halo-models.md

Introduction

November 20, 2024 Release Goals

Glossary

Schedule

Status-Numerics

Status-Benchmark

Issues

AMD GPU Machines

Test Reports

MLIR generation and Compilation

Evaluation tests

Perplexity

Accessing sharkblobs on Azure:

Archive

Status (Old)

Goals (Old)

Tasks and Issues (Old)

Artifacts (Old)

Guideline:

TP1

TP8

Files

halo-models.md

Latest commit

History

halo-models.md

File metadata and controls

Introduction

November 20, 2024 Release Goals

Glossary

Schedule

Status-Numerics

Status-Benchmark

Issues

AMD GPU Machines

Test Reports

MLIR generation and Compilation

Evaluation tests

Perplexity

Accessing sharkblobs on Azure:

Archive

Status (Old)

Goals (Old)

Tasks and Issues (Old)

Artifacts (Old)

Guideline:

TP1

TP8