Scalable and High Performance LLM Inference with Apache Spark™
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Generative AI |
INDUSTRY | Enterprise Technology, Media and Entertainment |
TECHNOLOGIES | AI/Machine Learning, Apache Spark, GenAI/LLMs |
SKILL LEVEL | Intermediate |
DURATION | 40 |
In recent years, Apache Spark™ distributed runtime has emerged as a powerful approach to enable ML training and inference. From barrier execution mode to TorchDistributor in Spark 3.4, multiple gaps have been bridged to better serve the ML community. One critical feature to enable ML workloads on Spark is the ability to run tasks by utilizing GPU acceleration. In this session, we will present our Spark platform, which leverages mixed compute resources (GPU/CPU) to power large-scale LLM inference efficiently. We will discuss scenarios where data processing and model inference can be combined in Spark to improve performance and cost efficiency in one single pipeline. Our novel approach and production experience with stage-level resource scheduling have dynamic resource allocation on/off. We will also dive into advanced solutions to scale batch inference on Spark when working with Nvidia Triton Server and vLLM and how to tackle common production challenges at scale.
SESSION SPEAKERS
Chenya Zhang
/Tech Lead Engineering Manager
Apple
IMAGE COMING SOON
Sam Huang
/Apple Inc.