Llama 4 herd is here with Day 0 inference support in vLLM

Today, Meta released its newest version of the Llama model family–Llama 4, enabling developers to build more personalized multimodal experiences. Thanks to our close collaboration with Meta, the vLLM team from Red Hat and UC Berkeley have enabled Day 0 model support, meaning you can start inferencing Llama 4 with vLLM today. This is a big day for open source AI, as it shows the true power of vLLM and its robust, collaborative community.

The Llama 4 release brings us two models—Llama 4 Scout and Llama 4 Maverick. Both Scout and Maverick come with BF16 weights. Additionally, Maverick also comes with the FP8-quantized version on Day 0. The FP8 code in vLLM and Hugging Face was supported by Meta using Red Hat’s open source LLM Compressor, a library for quantizing LLMs for faster and more efficient inference with vLLM.

Read on to learn about Llama 4 Scout and Llama 4 Maverick, what’s new in the Llama 4 herd, and how to get started with inferencing in vLLM today.

Meet Llama 4 herd–Scout and Maverick

The Llama 4 release comes with two model variations–Llama 4 Scout and Llama 4 Maverick.

Llama 4 Scout

Llama 4 Scout is a multimodal model with:

17 billion active parameters
16 experts
109 billion total parameters.

Scout delivers industry leading performance for its class and it fits on a single NVIDIA H100 node. Scout dramatically increases the supported context length from 128K in Llama 3 to an industry leading 10 million tokens. This opens up a world of possibilities, including multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases.

Llama 4 Maverick

Llama 4 Maverick is a general purpose LLM with:

17 billion active parameters
128 experts
400 billion total parameters

Maverick offers higher quality at a lower price compared to Llama 3.3 70B. It brings unparalleled, industry-leading performance in image and text understanding with support for 12 languages, enabling the creation of sophisticated AI applications that bridge language barriers. Maverick is great for precise image understanding and creative writing. For developers, it offers state-of-the-art intelligence with high speed, optimized for best response quality on tone, and refusals.

The official release by Meta includes an FP8-quantized version of Llama 4 Maverick 128E, enabling the 128 expert model to fit on a single NVIDIA 8xH100 node, resulting in more performance with lower costs.

What’s new in Llama 4?

The power of mixture of experts (MoE) architecture

Llama 4 Scout and Llama 4 Maverick are the first of Meta’s models that use a mixture of experts (MoE) architecture. In MoE models, a single token activates only a fraction of the total parameters. MoE architectures are more compute efficient for model training and inference and, given a fixed training FLOPs budget, deliver higher quality models compared to dense architectures.

To break it down, Llama 4 Maverick has 400 billion total parameters in the model, but the inference internally directs to an "expert", where only 17B parameters need to be processed per token. Furthermore, Llama 4 Maverick with 128 experts comes with FP8 weights, enabling the model to fit on a single 8xH100 node, resulting in faster and more efficient inference.

For high accuracy recovery, Maverick leverages channelwise weight quantization and dynamic per token activation quantization applied in a non-uniform manner. The Red Hat team (led by Eliza Wzsola) recently added a CUTLASS based kernel for GroupedGEMM in vLLM. Maverick leverages this kernel code and is a follow on to the existing work we have done leveraging CUTLASS 3.5.1, explained in our blogs vLLM brings FP8 inference to the open source community and Introducing Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs.

Adoption of early fusion multimodality

The Llama 4 models are built with native multimodality, incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone. Unlike previous Llama models, Scout and Maverick don’t freeze the text parameters or use separate multimodal parameters while training with images and videos. Early fusion is a major step forward, since it enables the joint pre-training of models with large amounts of unlabeled text, image, and video data.

Protecting developers against severe risks

Meta’s hope for Llama 4 is to develop the most helpful, useful models for developers while protecting against and mitigating the most severe risks. Llama 4 models are built with the best practices outlined in Meta’s Developer Use Guide: AI Protections. This includes integrating mitigations at each layer of model development from pre-training to post training and tunable system-level mitigations that shield developers from adversarial users. In doing so, the Meta team empowers developers to create helpful, safe and adaptable experiences for their Llama supported applications.

Day 0 vLLM support for inferencing Llama 4 models

As the leading commercial contributor to vLLM, Red Hat is excited that Meta has selected vLLM to support the immediate inferencing of Llama 4 models. This is no surprise for us. Originally developed at UC Berkeley, vLLM has become the de facto standard for open source inference serving with 44,000 GitHub stars and approaching one million weekly PyPI installs.

The vLLM community’s close collaboration with Meta during the pre-release process ensures developers can deploy the latest models as soon as they are available.

Get started with inferencing Llama 4 in vLLM now

You can install vLLM seamlessly using pip:

pip install -U vllm

Copy snippet

Once installed, you can run a simple command to serve any of the models in the Llama 4 family:

vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 430000

Copy snippet

The model in the above examples is the FP8-quantized version of Llama 4 Maverick. You can experiment with other Llama 4 model variations by pointing to the appropriate model stub on Hugging Face or one of Red Hat's quantized models located here.

Conclusion

The release of the Llama 4 herd marks a pivotal moment in the world of open source AI. With the combination of mixture of experts architecture and early fusion of multimodality, these models are enabling developers to build more personalized multimodal experiences.

By partnering with the vLLM community, Meta is ensuring developers can take advantage of Llama 4 models immediately, with a focus on performance and lower deployment costs.

Red Hat is proud to be a top commercial contributor to vLLM, driving these innovations forward and empowering the community with open, efficient, and scalable AI solutions. For more information and further details on getting started with vLLM, visit the GitHub repository.

Learn how to get started with inferencing inside OpenShift AI: Llama 4 Herd is here and already works with Red Hat OpenShift AI

Last updated: April 10, 2025

Red Hat Developer Sandbox

Programming Languages & Frameworks

System Design & Architecture

Developer Productivity

Automated Data Processing

Platform Engineering

Secure Development & Architectures

E-Books

Cheat Sheets

Documentation

Ansible.com

Red Hat Ecosystem Catalog

Red Hat Hybrid Cloud Console

Red Hat Store

Red Hat Summit and AnsibleFest

Llama 4 herd is here with Day 0 inference support in vLLM

Meet Llama 4 herd–Scout and Maverick

Llama 4 Scout

Llama 4 Maverick

What’s new in Llama 4?

The power of mixture of experts (MoE) architecture

Adoption of early fusion multimodality

Protecting developers against severe risks

Day 0 vLLM support for inferencing Llama 4 models

Get started with inferencing Llama 4 in vLLM now

Conclusion

Install Python 3.13 on Red Hat Enterprise Linux from EPEL

Zero trust automation on AWS with Ansible and Terraform

Cloud bursting with confidential containers on OpenShift

Reach native speed with MacOS llama.cpp container inference

A deep dive into Apache Kafka's KRaft protocol

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue

Llama 4 herd is here with Day 0 inference support in vLLM

Share:

Meet Llama 4 herd–Scout and Maverick

Llama 4 Scout

Llama 4 Maverick

What’s new in Llama 4?

The power of mixture of experts (MoE) architecture

Adoption of early fusion multimodality

Protecting developers against severe risks

Day 0 vLLM support for inferencing Llama 4 models

Get started with inferencing Llama 4 in vLLM now

Conclusion

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue