Summary:
- Meta released its newest version of the Llama model family this weekend–Llama 4
- As a leading commercial contributor to the vLLM project, Red Hat collaborated with the Meta team to help enable Day Zero support for Llama 4 inference with vLLM
- Furthermore, the Red Hat OpenShift AI team has enabled our customers to experiment with Llama 4 using the latest release of vLLM inside their OpenShift AI environments.
Read on to get started with inferencing the Llama 4 Heard inside OpenShift AI!
Over the April 5 weekend, Meta released its newest version of the Llama model family–Llama 4, enabling developers to build more personalized multimodal experiences. Thanks to our close collaboration with Meta, the vLLM team from Red Hat and UC Berkeley enabled Day 0 model support in vLLM, meaning you can start inferencing Llama 4 with vLLM today. This is a big day for open source AI, as it shows the true power of vLLM and its robust, collaborative community.
What’s more, the Red Hat OpenShift AI team has enabled our customers to experiment with Llama 4 inside their OpenShift AI environments starting today.
Read on to learn about the Llama 4 Herd and how to get started with inferencing Llama 4 with vLLM inside OpenShift AI today.
Meet Llama 4 Herd–Scout and Maverick
The Llama 4 release brings us two models—Llama 4 Scout and Llama 4 Maverick. Both Scout and Maverick come with BF16 weights. Additionally, Maverick also comes with the FP8-quantized version on Day 0. The FP8 code in vLLM and Hugging Face was supported by Meta using Red Hat’s open source LLM Compressor, a library for quantizing LLMs for faster and more efficient inference with vLLM.
Llama 4 Scout
Llama 4 Scout is a multimodal model with:
- 17 billion active parameters
- 16 experts
- 109 billion total parameters.
Scout delivers industry leading performance for its class and it fits on a single NVIDIA H100 node. Scout dramatically increases the supported context length from 128K in Llama 3 to an industry leading 10 million tokens. This opens up a world of possibilities, including multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases.
Llama 4 Maverick
Llama 4 Maverick is a general purpose LLM with:
- 17 billion active parameters
- 128 experts
- 400 billion total parameters
Maverick offers higher quality at a lower price compared to Llama 3.3 70B. It brings unparalleled, industry-leading performance in image and text understanding with support for 12 languages, enabling the creation of sophisticated AI applications that bridge language barriers. Maverick is great for precise image understanding and creative writing. For developers, it offers state-of-the-art intelligence with high speed, optimized for best response quality on tone, and refusals.
The official release by Meta includes an FP8-quantized version of Llama 4 Maverick 128E, enabling the 128 expert model to fit on a single NVIDIA 8xH100 node, resulting in more performance with lower costs.
What’s New in Llama 4?
The Power of Mixture of Experts (MoE) Architecture
Llama 4 Scout and Llama 4 Maverick are the first of Meta’s models that use a mixture of experts (MoE) architecture. In MoE models, a single token activates only a fraction of the total parameters. MoE architectures are more compute efficient for model training and inference and, given a fixed training FLOPs budget, deliver higher quality models compared to dense architectures.
To break it down, Llama 4 Maverick has 400 billion total parameters in the model, but the inference internally directs to an "expert", where only 17B parameters need to be processed per token. Furthermore, Llama 4 Maverick with 128 experts comes with FP8 weights, enabling the model to fit on a single 8xH100 node, resulting in faster and more efficient inference.
For high accuracy recovery, Maverick leverages channelwise weight quantization and dynamic per token activation quantization applied in a non-uniform manner using LLM Compressor. The Red Hat team (led by Eliza Wzsola) recently added a CUTLASS based kernel for GroupedGEMM in vLLM. Maverick leverages this kernel code and is a follow on to the existing work we have done leveraging CUTLASS 3.5.1, explained in our blogs vLLM brings FP8 inference to the open source community and Introducing Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs.
Adoption of Early Fusion Multimodality
The Llama 4 models are built with native multimodality, incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone. Unlike previous Llama models, Scout and Maverick don’t freeze the text parameters or use separate multimodal parameters while training with images and videos. Early fusion is a major step forward, since it enables the joint pre-training of models with large amounts of unlabeled text, image, and video data.
Protecting Developers Against Severe Risks
Meta’s plan for Llama 4 is to develop the most helpful, useful models for developers while protecting against and mitigating the most severe risks. Llama 4 models are built with the best practices outlined in Meta’s Developer Use Guide: AI Protections. This includes integrating mitigations at each layer of model development from pre-training to post training and tunable system-level mitigations that shield developers from adversarial users. In doing so, the Meta team empowers developers to create helpful, safe and adaptable experiences for their Llama supported applications.
Llama 4 in OpenShift AI: Stable? Latest? Why choose?
Red Hat OpenShift AI comes in 3 distinct Release types: Fast, Stable, and EUS. However, due to its flexibility, we are able to incorporate some of the latest developments in AI to it, regardless of version!
In the following sections, we will explore how any of our Red Hat OpenShift customers can get a preview of what the future updates will look like in OpenShift AI, and more specifically, those that will enable the deployment of the recently released Llama 4 models. This process extends the supported footprint of OpenShift AI with very recently released upstream community releases, which is a bit more involved than simply deploying the stable and supported releases of vLLM that are included out of the box. A future OpenShift AI release will include updated vLLM out of the box to enable Llama 4 deployment.
Llama4 on OpenShift AI Deployment - Ingredients and high-level steps
In order to deploy Llama4 in an OpenShift AI cluster, you will require these key ingredients:
- An OpenShift Cluster with OpenShift AI (Version 2.13 or above).
- A fast-enough access to the Internet (more specifically, HuggingFace).
- A node in your cluster with:
- Enough VRAM to host the model of your choice. We are detailing here the use of the Maverick FP8 model, which will use 75GB of VRAM on each of 8xH100 GPUs.
- 200 GB of RAM and 16 CPUs.
- Around 500 GB of free space, or more, on the root partition to support unpacking the model to a Kubernetes emptyDir volume.
Note that for our example code, we’re specifically using a node with 8xNVIDIA H100s and have completed the prerequisite steps of installing and configuring Node Feature Discovery and the Certified NVIDIA GPU Operator. If you are using different GPUs, you may have to make changes to some of the manifests we detail here, including potentially using a different vLLM runtime image.
Also note that this blog post should not be construed as a statement of support for the model, or using the upstream vLLM image as a ServingRuntime, but rather a recipe on how our customers can achieve their goals of Private AI on Red Hat’s platform, including running the very latest models.
At a very high level, the process to do so will have 3 broad steps:
- Add a Custom Runtime Definition (in YAML)
- Download the model(s) to Object StorageDeploy the model
- Use a method of interacting with the deployed model!
Llama 4 on OpenShift AI Deployment - Detailed steps
Adding a Custom Serving Runtime to Red Hat OpenShift AI
As of this writing, the version of the vLLM serving runtime provided with OpenShift AI is not (yet!) compatible with Llama 4. But fear not! That is exactly why we make allowance for the addition of Custom Runtimes in our OpenShift AI platform!
Because this is customer-written or community-supported code, we support your ability to add a custom runtime, but we do not provide support for the custom runtime itself.
Here are the steps to follow:
- Log in your OpenShift AI Dashboard as an OpenShift AI user.
- Navigate to Settings -> Serving Runtimes
- Click "Add Serving Runtime"
- Select Single-Model serving platform, and REST
- Click Start from Scratch
- Paste the following YAML code into that text box:
---
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
annotations:
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
openshift.io/display-name: Custom vLLM 0.8.3 Runtime - 2025-04-04
labels:
opendatahub.io/dashboard: "true"
name: vllm-runtime-2025-04-04
spec:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8080"
containers:
- args:
- --port=8080
- --model=/mnt/models
- --served-model-name={{.Name}}
- --max-model-len=430000
- --tensor-parallel-size=8
command:
- python
- -m
- vllm.entrypoints.openai.api_server
env:
- name: HF_HOME
value: /tmp/hf_home
image:
quay.io/vllm/vllm@sha256:14e86c0d58faaf94d3cf0ef77b64b8dff70c5cbb4a3529dabdfc61362681c0c6
name: kserve-container
ports:
- containerPort: 8080
protocol: TCP
multiModel: false
supportedModelFormats:
- autoSelect: true
name: vLLM
- Click “Create”
- Confirm Runtime has been created:
Downloading the model to an Object Storage Bucket
While you’re not required to explicitly use object storage to serve models in OpenShift AI, it does make things a bit simpler in terms of scalability and flexibility. It’s also nice to have a static copy that’s local to the cluster after a lengthy download process, so you don’t have to reach out to the internet each time you need to restart your model.
With CLI/Code
The CLI warriors out there, we’re sure you already have a preferred way of downloading models from HuggingFace, and also to upload things into Object Storage, so we'll not be too descriptive about that method.
With a custom Workbench
For those slightly less used to dealing with models, HuggingFace, and object storage, we do have a small community-provided Workbench image that we created in part for this purpose.
We'll give high-level instructions on how to use it to download the model from HuggingFace to object storage.
In case you don't have readily available object storage, this tutorial can help you deploy your own for quick prototyping.
Here is how to add the Custom Workbench Image:
- Log in to the OpenShift AI Dashboard as an OpenShift or OpenShift AI Admin
- Navigate to Settings, then to Notebook Images
- Click Import new image
- Enter the image location as: quay.io/rh-aiservices-bu/odh-tec:1.2.1
- Enter the Name as: Custom: ODH-TEC v1.2.1
- Click Import
- Confirm proper creation:
And now, here is how to use this ODH-TEC Workbench to help download the models into object storage
- In OpenShift AI, Create a new project called "llama4"
- In that project, create an Object Storage Connection that points to your bucket
- Create a connection of type "S3"
- Fill out all the required fields:
- Confirm it worked:
- Create a connection of type "S3"
- Now that the connection for our bucket exists, we will create a Workbench that uses it.
- Here are the steps:
- Navigate to the Project, and Create a Workbench:
- Give your workbench a name, and select the previously added Custom Image:
- Note that this workbench does not need to have too much storage:
- Attach an existing connection:
- Select the previously created connection and attach it:
- Create the workbench:
- Wait for the workbench to start:
- Once it's started, click on the "open" link:
- Authorize access by clicking “Allow selected permissions:”
- Accept the disclaimer:
- Navigate to Settings, then HuggingFace Settings, then, paste your HuggingFace token.
- After that, Test Connection, and if it worked, then Save HuggingFace Settings.
- Once done, navigate back to S3 Tools, Object Browser, select your bucket, and then click Import HF Model:
- In the popup, paste the path to the model on HuggingFace, in this example meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8:
- The model download will start when you click Import:
- Navigate to the Project, and Create a Workbench:
- Now, wait patiently for everything to be downloaded. Your network performance and the speed of your chosen S3 provider’s backing store will impact how long it takes.
Deploy the Model
- Ensure you are logged into OpenShift AI Dashboard
- Ensure you are in the project created earlier, called llama4
- Select "Single-model model serving"
- Click "Deploy model"
- Name: llama4-maverick
- Select the newly created runtime called "Custom vLLM 0.8.3 Runtime - 2025-04-04"
- Select 1 replica
- Custom size:
- CPUs: requested 8, limit 16
- Memory: requested 256GB, limit 384GB
- Accelerator: NVIDIA H100
- Number of accelerators: 8
- Check the boxes to expose the model through a Route, and enable token authentication, as shown below.
- Use the same existing data connection you used for the ODH TEC workbench as the model source:
- State the path: /meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
- Configure the parameters, overriding the served-model-name to set the real one: --served-model-name=meta-llama/Llama-4-Maverick-17B-128E
- Override the served-model-name to set the real one: --served-model-name=meta-llama/Llama-4-Maverick-17B-128E
- Hit the "Deploy" button.
- Wait. … This will take a while, especially if you have slow storage on the node.
- If you want to watch the pods logs to see the model being downloaded from S3 and loaded into memory, toggle over to OpenShift Console.
- In the navigation bar on the left, in the “Workloads” section, select “Pods”.
- Navigate to the project you created above, llama4, using the Project pulldown at the top.
- Click on the name of the Pod that starts with “llama4-maverick-predictor”.
- Select “Logs” in the tabs at the top.
- If you just deployed the model, you can use the pulldown to select the Init Container named “storage-initializer” to see the model being downloaded from object storage.
- It may also be useful to look at the Events tab, if the vLLM image is still pulling (it’s quite large)
- The logs in the kserve-container Container will show the model being loaded and served by vLLM.
- In the navigation bar on the left, in the “Workloads” section, select “Pods”.
- When the Pod shows ready, or the Model Server shows a green checkbox on the right, your model is serving and ready to receive requests.
At this point, the model is technically running. But if a model is alone in the forest and no-one sends any questions to it, is it really running?
(Bonus) Use an AnythingLLM workbench to access Llama4
AnythingLLM is an open source tool for interacting with language models, and includes an embedded vector database for drag-and-drop RAG queries. It’s usually installed as a desktop application, but it’s built using the Chromium-based Electron framework - it’s actually a website.
- Add another custom workbench image in your environment (by heading to “Settings” and “Notebook images” in the RHOAI user interface as an administrator).
- Enter the image location as: quay.io/rh-aiservices-bu/anythingllm-workbench:1.7.5
- Enter the Name as: Custom: AnythingLLM 1.7.5
- AnythingLLM has a concept of a “default” chat provider, and this setting is configurable. We often want to have some of this configuration automated, rather than clicking around in AnythingLLM to configure the chat provider, so we can use OpenShift AI’s Data Connections concept as a way to provide those configurations.
- Head to “Connection types,” still inside that left hand navigation bar:
- Click “Create connection type” near the top of the screen:
- Enter the Connection type name as: “AnythingLLM Credentials - Generic OpenAI”
- Enter the connection type description as: “This Connection Type” allows you to store your LLM serving runtime API details.
- In the Category box, select URI
- Click “Add section heading":
- In the “Add section heading” popup, enter “AnythingLLM Config” as the name of the heading and click “Add.”
- On the right side of the section heading showing at the bottom now, hit “Add field.”
- Enter the following information:
- Name: LLM Provider Type
- Environment variable: LLM_PROVIDER
- Type: Text - Short
- Default value: generic-openai
- Click “Add”
- Repeat this process with the following additional fields:
- Name: Base URL
Environment variable: GENERIC_OPEN_AI_BASE_PATH
Type: Text - Short
Default value: https://llama4-maverick.yourcluster.com:443/v1
Field is required: Checked - Name: API Key
Environment variable: GENERIC_OPEN_AI_API_KEY
Type: Text - Hidden
Default value: your-token-here
Field is required: Checked - Name: Chat Model Name
Environment variable: GENERIC_OPEN_AI_MODEL_PREF
Type: Text - Short
Default value: llama4-maverick
Field is required: Checked - Name: Model Token Limit
Environment variable: GENERIC_OPEN_AI_MODEL_TOKEN_LIMIT
Type: Numeric
In Advanced settings:
Unit: Tokens
Lower threshold: 1
Upper threshold: 128000
Default value: 4096
Field is required: Checked - Name: Embedding Provider
Environment variable: EMBEDDING_ENGINE
Type: Text - Short
Default value: native
Field is required: Checked - Name: Vector Database Provider
Environment variable: VECTOR_DB
Type: Text - Short
Default value: lancedb
Field is required: Checked - Name: Disable Telemetry
Environment variable: DISABLE_TELEMETRY
Type: Boolean
Checkbox label: Disables sending anonymous Telemetry to AnythingLLM
Default value: Checkbox is selected
- Name: Base URL
- Your fields should look like the following:
- Head to “Connection types,” still inside that left hand navigation bar:
- Create a Data Connection that specifies for AnythingLLM to connect to your deployed model:
- In one tab, access the OpenShift AI dashboard to recover the model connection details.
- From the Data Science Projects menu, select your llama4 Project
- Head to the models tab to see the model you deployed above
- Click on the “Internal and external endpoint details” button and
- Copy from the External section (to simplify networking setup for our chat application)
- Use the down arrow to the left of the model name
- You’ll want to be able to come back to the “Token authentication” section here to copy for your AnythingLLM Data Connection.
- In another tab, in the same Project, create a new connection for AnythingLLM. Select the AnythingLLM Credentials type, name it maverick-vllm.
- Fill in the Base URL and API key from the other tab, where we’re looking at the served model.
- Update the Chat Model Name to “meta-llama/Llama-4-Maverick-17B-128E” to align with our configuration for the model name in the Model serving deployment.
- Leave all other settings as the default and select “Create.”
- In one tab, access the OpenShift AI dashboard to recover the model connection details.
- Create a new workbench instance using the custom image and the Data Connection:
Deployment size can be minimal:
Storage can be minimal too:
Ensure you configure the Data Connection:
- Confirm that everything works as expected by opening your new Workbench. After having created a Workspace, you can begin to chat with the model!
Where to go from here?
Head over to the OpenShift AI Community Slack channel, #rhoai, hosted on the Red Hat OpenShift Service on AWS (ROSA) community workspace by using the link here. We’d love to chat with you about your experience working with upstream model serving runtimes and the new Llama 4 models!
The release of the Llama 4 herd marks a pivotal moment in the world of open source AI. With the combination of Mixture of Experts architecture and early fusion of multimodality, these models are enabling developers to build more personalized multimodal experiences.
By partnering with the vLLM community and with Red Hat, Meta is ensuring developers can take advantage of Llama 4 models immediately, with a focus on performance and lower deployment costs.
Red Hat is proud to be a top contributor to vLLM, driving these innovations forward and empowering both the community and our customers with open, efficient, and scalable AI solutions.
To learn more about vLLM, join our bi-weekly vLLM office hours where we provide regular project updates and dig into leading topics around inference acceleration.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech