Liyin Tang is a software engineering on the Data Infrastructure team at Airbnb. Before Airbnb, he worked at Facebook and Dropbox. He focuses on building high available and reliable storage services and helping the services scale in the face of exponential data growth.
Mr Tang joined HBase PMC in 2013 and also contributed to other Apache projects including HDFS and Hive. Recently, he is building a streaming infrastructure to power realtime data products at Aribnb.
He holds a master’s degree in computer science from University of Southern California.
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products. On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production. Session hashtag: #SFeco5
AirStream is a realtime stream computation framework built on top of Spark Streaming and Spark SQL. It allows engineers and data scientists at Airbnb to easily leverage Spark Streaming and SQL to get realtime insights and to build real-time feedback loops. Multiple mission critical applications have been built on top of it. In this talk, we will start with an overview of AirStream, and then go over a few production use cases such as realtime ingestion pipelines for data warehouse, and computing derived data for online data products. We will discuss how AirStream is integrated into our big data ecosystem such as Kafka, HBase and Hive, and share a series of lessons from that. Takeaways includes scaling multiple Streaming jobs while consuming from a single Kafka cluster, and managing streaming job's life cycles and checkpoints, and best practices to leverage HBase as stateful storage.