Technically Speaking
Technically Speaking | Building more efficient AI with vLLM

This video can't play due to privacy settings

To change your settings, select the "Cookie Preferences" link in the footer and opt in to "Advertising Cookies."

Building more efficient AI with vLLM ft. Nick Hill

  |  Technically Speaking Team  
Artificial intelligence

We've seen AI deliver complex results in seconds, but there's a performance paradox at its core: the powerful GPUs that run these models are often underutilized. This core inefficiency, caused by memory bottlenecks during the inference process, drives up costs and limits what's possible with the technology. This is the world of inference optimization, and it's where open source communities are quietly shaping the future.

In this episode, Red Hat CTO Chris Wright talks to Nick Hill, a key contributor to the vLLM open source project, about the innovations directly tackling this problem. They dive into PagedAttention, the technique that eliminates memory fragmentation by changing how the KV cache is managed, and discuss how it—combined with speculative decoding—maximizes GPU throughput. This is a systems-level look at making powerful AI practical and performant at scale. Tune in to better understand the foundational technology that will power the next wave of enterprise innovation.

Transcript

About the show

Technically Speaking

What’s next for enterprise IT? No one has all the answers—But CTO Chris Wright knows the tech experts and industry leaders who are working on them.