Accelerating LLM Inference with vLLM
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Data Science and Machine Learning |
INDUSTRY | Enterprise Technology |
TECHNOLOGIES | AI/Machine Learning, GenAI/LLMs |
SKILL LEVEL | Intermediate |
DURATION | 40 min |
vLLM is an open-source highly performant engine for LLM inference and serving developed at UC Berkeley. vLLM has been widely adopted across the industry, with 12K+ GitHub stars and 150+ contributors worldwide. Since its initial release, the vLLM team has improved performance by more than 10x. This session will cover various topics in LLM inference performance, including paged attention and continuous batching. Then, we will focus on new innovations we’ve made to vLLM and the technical challenges behind them, including: Speculative Decoding, Prefix Caching, Disaggregated Prefill, and multi-accelerator support. The session will conclude with industry case studies of vLLM and future roadmap plans. Takeaways:
- vLLM is an open source engine for LLM inference and serving, providing state-of-the-art performance and an accelerator-agnostic design.
- In focusing on production-readiness and extensibility, vLLM’s design choices have led to new system insights and rapid community adoption.
SESSION SPEAKERS
Zhuohan Li
/PhD student
UC Berkeley / vLLM
Cade Daniel
/Software Engineer
Anyscale