These are end-to-end pipelines that demonstrate the power of MAX for accelerating common AI workloads, and more. Each of the supported pipelines can be served via an OpenAI-compatible endpoint.
MAX can also serve most PyTorch-based large language models that are present on Hugging Face, although not at the same performance as native MAX Graph versions.
Highly optimized MAX Graph implementations exist for several core model architectures. These include:
- Llama 3.1: A text completion pipeline using the Llama 3.1 model,
implemented using the MAX Graph API. This pipeline contains everything
needed to run a self-hosted large language model in the
LlamaForCausalLM
family with state-of-the-art serving throughput. - Mistral: Support for the
MistralForCausalLM
family of text completion models, by default using the Mistral NeMo 12B model. This pipeline has been tuned for performance using the MAX Graph API. - Replit Code: Code generation via the Replit Code V1.5 3B model, implemented using the MAX Graph API.
- DeepSeek Coder: Code generation via the Deepseek Coder V1.5 7B model, implemented using the MAX Graph API.
Instructions for how to run each pipeline can be found in their respective subdirectories, along with all configuration parameters. A shared driver is used to execute the pipelines.
The easiest way to try out any of the pipelines is with our Magic command-line tool.
-
Install Magic on macOS and Ubuntu with this command:
curl -ssL https://magic.modular.com | bash
Then run the source command that's printed in your terminal.
To see the available commands, you can run
magic --help
. Learn more about Magic here. -
Clone the MAX examples repository:
If you don't already have a local clone of this repository, create one via:
git clone https://github.com/modularml/max.git
The following instructions assume that you're present within this directory, and you can change to it after cloning:
cd max/pipelines/python/
-
Now run one of the text completion demos with any of following commands:
magic run llama3 --prompt "I believe the meaning of life is" magic run replit --prompt "def fibonacci(n):" magic run mistral --prompt "Why is the sky blue?"
-
Host a chat completion endpoint via MAX Serve.
MAX Serve provides functionality to host performant OpenAI compatible endpoints using the FastAPI framework.
You can configure a pipeline to be hosted by using the
--serve
argument. For example:magic run llama3 --serve
A request can be submitted via a cURL command.
curl -N http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/llama-3.1", "stream": true, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }'
Additionally, finetuned weights hosted on Hugging Face can be used with one of these optimized pipeline architectures when serving via the
serve
command:magic run serve --huggingface-repo-id=modularai/llama-3.1
If you provide a repository ID for a Hugging Face large language model that does not currently have an optimized MAX Graph implementation, MAX falls back to serving a PyTorch eager version of the model.
The following table lists the model architectures tested to work with MAX.
Architecture | Example Model |
---|---|
AquilaForCausalLM | BAAI/Aquila-7B |
ChatGLMModel | THUDM/chatglm3-6b |
GPT2LMHeadModel | openai-community/gpt2 |
GPTJForCausalLM | EleutherAI/gpt-j-6b |
LlamaForCausalLM | meta-llama/Llama-3.2-3B-Instruct |
LlamaForCausalLM | Skywork/Skywork-o1-Open-Llama-3.1-8B |
LlamaForCausalLM | deepseek-ai/deepseek-coder-1.3b-instruct |
PhiForCausalLM | microsoft/phi-2 |
Phi3ForCausalLM | microsoft/Phi-3-mini-4k-instruct |
Qwen2ForCausalLM | Qwen/Qwen2.5-1.5B-Instruct |