Airstream: Spark Streaming At Airbnb

Download Slides

AirStream is a realtime stream computation framework built on top of Spark Streaming and Spark SQL. It allows engineers and data scientists at Airbnb to easily leverage Spark Streaming and SQL to get realtime insights and to build real-time feedback loops. Multiple mission critical applications have been built on top of it. In this talk, we will start with an overview of AirStream, and then go over a few production use cases such as realtime ingestion pipelines for data warehouse, and computing derived data for online data products. We will discuss how AirStream is integrated into our big data ecosystem such as Kafka, HBase and Hive, and share a series of lessons from that. Takeaways includes scaling multiple Streaming jobs while consuming from a single Kafka cluster, and managing streaming job’s life cycles and checkpoints, and best practices to leverage HBase as stateful storage.

About Liyin Tang

Liyin Tang is a software engineering on the Data Infrastructure team at Airbnb. Before Airbnb, he worked at Facebook and Dropbox. He focuses on building high available and reliable storage services and helping the services scale in the face of exponential data growth. Mr Tang joined HBase PMC in 2013 and also contributed to other Apache projects including HDFS and Hive. Recently, he is building a streaming infrastructure to power realtime data products at Aribnb. He holds a master's degree in computer science from University of Southern California.

About Jingwei Lu

Jingwei Lu is currently a member of the Data Infrastructure team at Airbnb. He was previously a tech-leader in Facebook data infrastucture team in charge of Bumblebee project (hive/hadoop replacement) query processing and language. Prior to Facebook he redesigned SCOPE(Microsoft equivalent of hive) runtime in Microsoft. Spent 10 years in Microsoft SQL Server engine team building commercial relational database engine.