Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Quick links: redhat.com, Customer Portal, Red Hat's developer site, Red Hat's partner site.

  • You are here

    Red Hat

    Learn about our open source products, services, and company.

  • You are here

    Red Hat Customer Portal

    Get product support and knowledge from the open source experts.

  • You are here

    Red Hat Developer

    Read developer tutorials and download Red Hat software for cloud application development.

  • You are here

    Red Hat Partner Connect

    Get training, subscriptions, certifications, and more for partners to build, sell, and support customer solutions.

Products & tools

  • Ansible.com

    Learn about and try our IT automation product.
  • Red Hat Ecosystem Catalog

    Find hardware, software, and cloud providers―and download container images―certified to perform with Red Hat technologies.

Try, buy, & sell

  • Red Hat Hybrid Cloud Console

    Access technical how-tos, tutorials, and learning paths focused on Red Hat’s hybrid cloud managed services.
  • Red Hat Store

    Buy select Red Hat products and services online.
  • Red Hat Marketplace

    Try, buy, sell, and manage certified enterprise software for container-based environments.

Events

  • Red Hat Summit and AnsibleFest

    Register for and learn about our annual open source IT industry event.

Llama 4 herd is here with Day 0 inference support in vLLM

April 5, 2025
vLLM team at Red Hat
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI

Share:

Share on twitter Share on facebook Share on linkedin Share with email
  • Meet Llama 4 herd–Scout and Maverick
  • What’s new in Llama 4?
  • Day 0 vLLM support for inferencing Llama 4 models
  • Get started with inferencing Llama 4 in vLLM now
  • Conclusion

Today, Meta released its newest version of the Llama model family–Llama 4, enabling developers to build more personalized multimodal experiences. Thanks to our close collaboration with Meta, the vLLM team from Red Hat and UC Berkeley have enabled Day 0 model support, meaning you can start inferencing Llama 4 with vLLM today. This is a big day for open source AI, as it shows the true power of vLLM and its robust, collaborative community. 

The Llama 4 release brings us two models—Llama 4 Scout and Llama 4 Maverick. Both Scout and Maverick come with BF16 weights. Additionally, Maverick also comes with the FP8-quantized version on Day 0. The FP8 code in vLLM and Hugging Face was supported by Meta using Red Hat’s open source LLM Compressor, a library for quantizing LLMs for faster and more efficient inference with vLLM.

Read on to learn about Llama 4 Scout and Llama 4 Maverick, what’s new in the Llama 4 herd, and how to get started with inferencing in vLLM today.

Meet Llama 4 herd–Scout and Maverick

Meet Llama 4 herd–Scout and Maverick

The Llama 4 release comes with two model variations–Llama 4 Scout and Llama 4 Maverick.

Llama 4 Scout 

Llama 4 Scout is a multimodal model with:

  • 17 billion active parameters
  • 16 experts
  • 109 billion total parameters. 

Scout delivers industry leading performance for its class and it fits on a single NVIDIA H100 node. Scout dramatically increases the supported context length from 128K in Llama 3 to an industry leading 10 million tokens. This opens up a world of possibilities, including multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases.

Llama 4 Maverick 

Llama 4 Maverick is a general purpose LLM with:

  • 17 billion active parameters
  • 128 experts
  • 400 billion total parameters

Maverick offers higher quality at a lower price compared to Llama 3.3 70B. It brings unparalleled, industry-leading performance in image and text understanding with support for 12 languages, enabling the creation of sophisticated AI applications that bridge language barriers. Maverick is great for precise image understanding and creative writing. For developers, it offers state-of-the-art intelligence with high speed, optimized for best response quality on tone, and refusals. 

The official release by Meta includes an FP8-quantized version of Llama 4 Maverick 128E, enabling the 128 expert model to fit on a single NVIDIA 8xH100 node, resulting in more performance with lower costs.

What’s new in Llama 4?

What’s new in Llama 4?

The power of mixture of experts (MoE) architecture

Llama 4 Scout and Llama 4 Maverick are the first of Meta’s models that use a mixture of experts (MoE) architecture. In MoE models, a single token activates only a fraction of the total parameters. MoE architectures are more compute efficient for model training and inference and, given a fixed training FLOPs budget, deliver higher quality models compared to dense architectures. 

To break it down, Llama 4 Maverick has 400 billion total parameters in the model, but the inference internally directs to an "expert", where only 17B parameters need to be processed per token. Furthermore, Llama 4 Maverick with 128 experts comes with FP8 weights, enabling the model to fit on a single 8xH100 node, resulting in faster and more efficient inference. 

For high accuracy recovery, Maverick leverages channelwise weight quantization and dynamic per token activation quantization applied in a non-uniform manner. The Red Hat team (led by Eliza Wzsola) recently added a CUTLASS based kernel for GroupedGEMM in vLLM. Maverick leverages this kernel code and is a follow on to the existing work we have done leveraging CUTLASS 3.5.1, explained in our blogs vLLM brings FP8 inference to the open source community and Introducing Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs.

Adoption of early fusion multimodality

The Llama 4 models are built with native multimodality, incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone. Unlike previous Llama models, Scout and Maverick don’t freeze the text parameters or use separate multimodal parameters while training with images and videos. Early fusion is a major step forward, since it enables the joint pre-training of models with large amounts of unlabeled text, image, and video data.

Protecting developers against severe risks

Meta’s hope for Llama 4 is to develop the most helpful, useful models for developers while protecting against and mitigating the most severe risks. Llama 4 models are built with the best practices outlined in Meta’s Developer Use Guide: AI Protections. This includes integrating mitigations at each layer of model development from pre-training to post training and tunable system-level mitigations that shield developers from adversarial users. In doing so, the Meta team empowers developers to create helpful, safe and adaptable experiences for their Llama supported applications.

Day 0 vLLM support for inferencing Llama 4 models

Day 0 vLLM support for inferencing Llama 4 models

As the leading commercial contributor to vLLM, Red Hat is excited that Meta has selected vLLM to support the immediate inferencing of Llama 4 models. This is no surprise for us. Originally developed at UC Berkeley, vLLM has become the de facto standard for open source inference serving with 44,000 GitHub stars and approaching one million weekly PyPI installs.

The vLLM community’s close collaboration with Meta during the pre-release process ensures developers can deploy the latest models as soon as they are available.

Get started with inferencing Llama 4 in vLLM now

Get started with inferencing Llama 4 in vLLM now

You can install vLLM seamlessly using pip:

pip install -U vllm
Copy snippet

Once installed, you can run a simple command to serve any of the models in the Llama 4 family:

vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 430000
Copy snippet

The model in the above examples is the FP8-quantized version of Llama 4 Maverick. You can experiment with other Llama 4 model variations by pointing to the appropriate model stub on Hugging Face or one of Red Hat's quantized models located here.

Conclusion

Conclusion

The release of the Llama 4 herd marks a pivotal moment in the world of open source AI. With the combination of mixture of experts architecture and early fusion of multimodality, these models are enabling developers to build more personalized multimodal experiences. 

By partnering with the vLLM community, Meta is ensuring developers can take advantage of Llama 4 models immediately, with a focus on performance and lower deployment costs.

Red Hat is proud to be a top commercial contributor to vLLM, driving these innovations forward and empowering the community with open, efficient, and scalable AI solutions. For more information and further details on getting started with vLLM, visit the GitHub repository.

Learn how to get started with inferencing inside OpenShift AI: Llama 4 Herd is here and already works with Red Hat OpenShift AI

Last updated: April 10, 2025

Related Posts

  • LLM Compressor is here: Faster inference with vLLM

  • Deploy Llama 3 8B with vLLM

  • vLLM V1: Accelerating multimodal inference for large language models

  • How we optimized vLLM for DeepSeek-R1

  • vLLM V1 Alpha: A major upgrade to vLLM's core architecture

  • Distributed inference with vLLM

Recent Posts

  • How to set up NVIDIA NIM on Red Hat OpenShift AI

  • Leveraging Ansible Event-Driven Automation for Automatic CPU Scaling in OpenShift Virtualization

  • Python packaging for RHEL 9 & 10 using pyproject RPM macros

  • Kafka Monthly Digest: April 2025

  • How to scale smarter with OpenShift Serverless and Knative

What’s up next?

This hands-on learning path demonstrates how retrieval-augmented generation (RAG) works and how users can implement a RAG workflow using Red Hat OpenShift AI and Elasticsearch vector database.
Start the activity
Red Hat Developers logo LinkedIn YouTube Twitter Facebook

Products

  • Red Hat Enterprise Linux
  • Red Hat OpenShift
  • Red Hat Ansible Automation Platform
  • See all products
  • See all technologies

Build

  • Developer Sandbox
  • Developer Tools
  • Interactive Tutorials
  • API Catalog

Quicklinks

  • Learning Resources
  • E-books
  • Cheat Sheets
  • Blog
  • Events
  • Newsletter

Communicate

  • About us
  • Contact sales
  • Find a partner
  • Report a website issue
  • Site Status Dashboard
  • Report a security problem

RED HAT DEVELOPER

Build here. Go anywhere.

We serve the builders. The problem solvers who create careers with code.

Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

Sign me up

Red Hat legal and privacy links

  • About Red Hat
  • Jobs
  • Events
  • Locations
  • Contact Red Hat
  • Red Hat Blog
  • Inclusion at Red Hat
  • Cool Stuff Store
  • Red Hat Summit

Red Hat legal and privacy links

  • Privacy statement
  • Terms of use
  • All policies and guidelines
  • Digital accessibility

Report a website issue