This page is project tracker to get halo models like llama3, grok1 etc. working on one or more MI3xx using shark/iree.
- llama3.1 405B sharded across 8 MI300x GPUs producing correct numeical results (P0)
- llama3.1 405B sharded across 8 MI300x GPUs performant at level of vLLM PyTorch (Fused Ops Eager Mode) (P1)
(Note: Use llama3.1 8B or 70B to develop and test)
TPn: Tensor Parallel using n GPUs where a large tensor is sharded across multiple GPUs using sharktank and scatter/gather to/from GPUs is done in single MLIR
(Model is assumed to be llama3.1 in the following table, e.g. "8B FP8" means "llama3.1 8B FP8 model")
Item | 10/18/24 | 10/25/24 | 11/1/24 | 11/8/24 | 11/15/24 |
---|---|---|---|---|---|
Machine and Storage | two 8x MI300x SPX mode ensured working with how to use info added to Nod AI Lab @saienduri (Done:10/17) |
-Install 60TB storage on SharkMi300X (Done:10/21) -setup one more 8x air-cooled MI300 machine (SharkMi300X-3) with 60TB added (Done:10/24)@saienduri |
-Setup one more 8X MI300 air-cooled machine (SharkMi300X-4) with 60TB @saienduri -Add 30 TB to each of SharkMi300X and SharkMi300X-3 @saienduri |
||
Sharktank Modeling | IREE-compilable 8B FP8 MLIR @dan garvey (Done:10/17) |
-verify numerics using quant-dequant on cpu vs run on MI300 for 8B FP8 @dan -Get 8B 70B (Done:10/23) and 405B FP8 MLIR (ETA:10/24) and verify(CPU vs MI300) numerics for 70B @dan, -Wire up Perlexity flow to run vmfb using iree-run-module (score too high, ETA:10/24) @archana, -Debug 70B running OOM on 1 MI300 @kyle (ETA:10/25) -Quantized sharding support (ETA:10/24) @Ian |
Re-enerate and Verify MLIR without decomposition of SDPA for 8B, 70B, 405B for FP16 @kyle | ||
Sharding | 8 CPU core sharded FP16 numerically verified @boian. compilation issue ETA:10/25 | 8 GPU sharding for FP16 and FP8 compiling for MI300 @rob/@Ian sharding verified on CPU, compilation fails ETA: 10/25 | 8 GPU sharding for FP16 and FP8 numerics verified for MI300 @boian/@rob | ||
IREE codegeneration | 8B FP16 attention ahead with dynamic shape generating valid vmfb ETA:10/24 @mahesh, FP8 Attention (use Intrinsic for FP8 effectively) (ETA:10/25) @stanley | Paged Attention @kunwar/@manupa | Perf Tuning (all) | ||
Inference Profiling | Tracy profile 8B FP16 w/ decoposition @kyle (Done:10/17) | Tracy profile for 8B FP8 w/ and w/o decomposition @kyle, - Tracy profile 405B with 3 attention blocks w/ decomposition @Avi |
|||
Shortfin Serving | llama3.1 8B FP16 iree compiled working using shortfin @xida (KV cache is messed up, needs help) | ||||
W/ Serving Inference Performance | llama3.1 8B Fp16 iree compiled working using shortfin performance numbers @avi | Performance tuning for sharding @boin/@rob | |||
Test Automation | -8B FP16 prefill attnhead, decode atttnhead, & full model IREE-compiled perf tests in sharktank CI @avi -8B FP16 IREE-compiled numerics tested using Perlexity @archana |
-8B FP8 prefill attnhead, decode atttnhead, full model IREE-compiled perf test in sharktank CI @avi -8B FP8 IREE-compiled numerics tested using Perlexity @archana -8 CPU core sharded 8B FP16 numeric test added @boian |
8 GPU sharded 8B FP8 test added @boin | ||
Report dashboard | Show currently runnning all perf and numeric llama3.1 component and full model test reports on a page @saienduri | ||||
Release Packaging/testing | Have a test release with 8B FP16 @chris | test release with 8B FP8 @chris |
(MI300X GPU, SPX Mode)
Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
---|---|---|---|---|---|
llama3.1-8B-FP16-decomposed | PASS TP1 mlir gguf irpa | PASS vmfb | PASS | ||
llama3.1-8B-FP16-decomposed-TP8 | PASS (MLIR) | PASS | |||
llama3.1-70B-FP16-decomposed | PASS TP1 mlir gguf irpa | PASS vmfb | FAIL OOM | ||
llama3.1-405B-FP16-decomposed | PASS TP1 mlir gguf | ||||
llama3.1-8B-FP8-decomposed | PASS TP1 mlir irpa | Fails in iree, patch | |||
llama3.1-70B-FP8-decomposed | PASS TP1 mlir irpa | Fails in iree, patch | |||
llama3.1-405B-FP8-decomposed | |||||
llama3.1-8B-FP16 | PASS mlir | Fails in iree, patch | |||
llama3.1-70B-FP16 | PASS mlir | Fails in iree, patch | |||
llama3.1-405B-FP16 | |||||
llama3.1-8B-FP8 | FAIL qkv must have same data type | ||||
llama3.1-70B-FP8 | FAIL qkv must have same data type | ||||
llama3.1-405B-FP8 | FAIL qkv must have same data type | ||||
llama-toy-size-FP32-TP2-CPU | PASS | PASS |
(MI300X GPU, SPX Mode, Time in ms)
Item | 10/25/24 | 11/1/24 | 11/8/24 | 11/15/24 | Target(vLLM-PyTorch) |
---|---|---|---|---|---|
llama3.1-8B-FP16 | |||||
llama3.1-70B-FP16 | |||||
llama3.1-405B-FP16 | |||||
llama3.1-8B-FP8 | |||||
llama3.1-70B-FP8 | |||||
llama3.1-405B-FP8 |
category | issue link | assigned to | status |
---|---|---|---|
iree codegen | 18864 | unassigned | OOM for 70B |
quark quantization | QUARK-71 | unassigned | FP8 matmul should be used in attention |
TBD: Sai please put link to nightly tests that test any of component or full model of llama3
export fp8:
python -m sharktank.examples.export_paged_llm_v1 --irpa-file=native_fp8_e4m3fnuz_llama3_8b.irpa --output-mlir native_fp8_e4m3fnuz_llama3_8b.mlir --no-fake-quant --bs=1 --attention-kernel=torch_sdpa
iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 <mlir file> -o <vmfb file>
Follow the steps here
In browser, click on sharkblobs , then click on "Blob-containers" and the click on "halo-models"
Or, use command line by first installing az cli as:
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
And then, get the account key for the storage account by clicking on "Storage Accounts" in Azure Services or searching "sharkblobs" in the top search bar. Then, click on sharkblobs. Then, on the left side bar, under Security + networking, click on "Access keys". Copy the account key from here and use in the following command To upload:
az storage blob upload --account-name sharkblobs --container-name sharkblobs --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>
To download:
az storage blob download --account-name sharkblobs --container-name sharkblobs --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>
if you are downloading from "sharkpublic" then replace instructions above by sharkpublic and get your account access key for sharkpublic. Example:
az storage blob download --account-name sharkpublic --container-name sharkpublic --name ian/llama8b_f16.gguf --file llama8b_f16.gguf --account-key <key string>
(Note: Do not update this one)
Models | compile | inference (SPX mode) | tracy |
---|---|---|---|
llama3.1-8b-FP16 | PASS | prefill (1746 ms), decode (71.8 ms), commands | prefill decode |
llama3.1-8b-Q4_1 | PASS | prefill (1817 ms), decode (57.3 ms), commands | prefill decode |
llama3.1-8b-Q4_k | PASS | ||
llama3.1-70b-Q4_1 | PASS | prefill (3543 ms), decode (213 ms), commands | prefill decode |
llama2-7b-FP8 | FAIL | ||
grok-1-Q4_1 | PASS | FAIL, out of memory | prefill decode |
(Note: No longer updated)
- Attention Compiler Work
- Dynamic sequence length
- Causal Masking
- Flex attention compilation
- LLaMa 8b prefill and decode
- validated numerically correct
- export
- compiled
- benchmarked
- replicate for larger variants
- Mixtral prefill and decode
- validated numerically correct
- export
- compiled
- benchmarked
- Grok prefill and decode
- validated numerically correct
- export
- compiled
- benchmarked
(Scheduled for deprecation, move any relevant to Schedule table at top)
task | owner | status | next actions |
---|---|---|---|
Sharded LLaMa | boian | In progress | Landing first sharded tests |
Export/Compile LLaMa | kyle | blocked on torch.aten.complex |
rob is authoring fix |
LLaMa 8 prefill comparison | rob | layerwise comparison for prefill is normal | handing off tooling to Avi |
LLaMa 8 decode comparison | avi | still investigating cause of numeric issue | reuse rob's tooling to investigate |
FP8 quantized model | dan | finishing results from quark | following up with Giuseppe on new `fp8 quantization |
Model evaluation tooling | archana | Perplexity CI nightly running in eager mode | working on getting perplexity with vmfb |
(Note: Update Schedule-Numerics table for llama3.1 artifacts instead of this table (10/20/2024 onwards))
- small files and MLIR files check into llm-dev
- large files upload to sharkblobs -> "halo-models" container on Azure and put link to that in the table(s) below
- Very large files, store on GPU server and note the name/location of/on the machine in table(s) below
Note: If a link to Azure sharkblob below gives you an error, either use az cli to download (see section Accessing sharkblobs on Azure) or click on sharkblobs , then click on "Blob containers" and then navigate to the file manually and download it.
Models | FP16 | FP8 | Q4_1 | Q4_K | Attention IRs |
---|---|---|---|---|---|
llama2-7b | irpa mlir | Attention IRs | |||
llama3-8b | mlir gguf | mlir irpa | mlir gguf | mlir gguf | |
llama3-70b | mlir gguf | mlir irpa | mlir gguf | mlir gguf | |
llama3-405b | mlir gguf | mlir gguf | mlir gguf | ||
grok-1 | mlir gguf | NA | mlir gguf | gguf |
Models | FP16 | FP8 | Q4_1 | Q4_K |
---|---|---|---|---|
llama3.1-8b | ||||
llama3.1-70b | ||||
llama3.1-405b | ||||
grok-1 |