Jingwei Lu

Member of the Data Infrastructure Team, Airbnb

Jingwei Lu is currently a member of the Data Infrastructure team at Airbnb. He was previously a tech-leader in Facebook data infrastucture team in charge of Bumblebee project (hive/hadoop replacement) query processing and language. Prior to Facebook he redesigned SCOPE(Microsoft equivalent of hive) runtime in Microsoft. Spent 10 years in Microsoft SQL Server engine team building commercial relational database engine.



Building Data Product Based on Apache Spark at AirbnbSummit 2017

Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products. On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production. Session hashtag: #SFeco5

Airstream: Spark Streaming At AirbnbSummit 2016

AirStream is a realtime stream computation framework built on top of Spark Streaming and Spark SQL. It allows engineers and data scientists at Airbnb to easily leverage Spark Streaming and SQL to get realtime insights and to build real-time feedback loops. Multiple mission critical applications have been built on top of it. In this talk, we will start with an overview of AirStream, and then go over a few production use cases such as realtime ingestion pipelines for data warehouse, and computing derived data for online data products. We will discuss how AirStream is integrated into our big data ecosystem such as Kafka, HBase and Hive, and share a series of lessons from that. Takeaways includes scaling multiple Streaming jobs while consuming from a single Kafka cluster, and managing streaming job's life cycles and checkpoints, and best practices to leverage HBase as stateful storage.

Learn more:
  • Streaming - Getting Started with Apache Spark on Databricks
  • Diving into Apache Spark Streaming’s Execution Model
  • Making Apache Spark the Fastest Open Source Streaming Engine