Session

Real-Time Mode Technical Deep Dive: How We Built Sub-300 Millisecond Streaming Into Apache Spark™

Overview

ExperienceIn Person
TypeBreakout
TrackData Engineering and Streaming
IndustryEnergy and Utilities, Media and Entertainment, Financial Services
TechnologiesApache Spark, DLT, LakeFlow
Skill LevelAdvanced
Duration40 min

Real-time mode is a new low-latency execution mode for Apache Spark™ Structured Streaming. It can consistently provide p99 latencies less than 300 milliseconds for a broad set of stateless and stateful streaming queries. Our talk focuses on the technical aspects of making this possible in Spark.

 

We’ll dive into the core architecture that enables these dramatic latency improvements, including a concurrent stage scheduler and a non-blocking shuffle. We’ll explore how we maintained Spark’s fault-tolerance guarantees, and we’ll also share specific optimizations we made to our streaming SQL operators.

 

These architectural improvements have already enabled Databricks customers to build workloads with latencies up to 10x lower than before. Early adopters in our Private Preview have successfully implemented real-time enrichment pipelines and feature engineering for machine learning — use cases that were previously impossible at these latencies.

Session Speakers

IMAGE COMING SOON

Jerry Peng

/Staff Software Engineer
Databricks

IMAGE COMING SOON

Neil Ramaswamy

/Software Engineer
Databricks