SESSION

Scalable and High Performance LLM Inference with Apache Spark™

OVERVIEW

EXPERIENCEIn Person
TYPEBreakout
TRACKGenerative AI
INDUSTRYEnterprise Technology, Media and Entertainment
TECHNOLOGIESAI/Machine Learning, Apache Spark, GenAI/LLMs
SKILL LEVELIntermediate
DURATION40

In recent years, Apache Spark™ distributed runtime has emerged as a powerful approach to enable ML training and inference. From barrier execution mode to TorchDistributor in Spark 3.4, multiple gaps have been bridged to better serve the ML community. One critical feature to enable ML workloads on Spark is the ability to run tasks by utilizing GPU acceleration. In this session, we will present our Spark platform, which leverages mixed compute resources (GPU/CPU) to power large-scale LLM inference efficiently. We will discuss scenarios where data processing and model inference can be combined in Spark to improve performance and cost efficiency in one single pipeline. Our novel approach and production experience with stage-level resource scheduling have dynamic resource allocation on/off. We will also dive into advanced solutions to scale batch inference on Spark when working with Nvidia Triton Server and vLLM and how to tackle common production challenges at scale.

SESSION SPEAKERS

Chenya Zhang

/Tech Lead Engineering Manager
Apple

IMAGE COMING SOON

Sam Huang

/Apple Inc.