Zhan Zhang - Databricks

Zhan Zhang

Software Engineer, Facebook

Zhan Zhang is a Software Engineer at Facebook, where he is on the Data Infra group, focusing on large scale distributed systems, especially Apache Spark in production. He obtained his PhD from University of Florida, and is interested in Distributed System and Large Scale machine learning. Zhan is an active contributor to several Apache projects, such as Apache Spark, Yarn, HBase, etc, and has presented his work in Hadoop Summit (Dublin 2016), and HBaseCon (San Francisco, 2016), and Spark Summit (San Francisco, 2017)


Migrating Apache Hive Workload to Apache Spark: Bridge the GapSummit 2018

At Spark Summit 2017, we described our framework to migrate production Hive workload to Spark with minimal user intervention. After a year of migration, Spark now powers an important part of our batch processing workload. The migration framework supports syntax compatibility analysis, offline/online shadowing, and data validation. In this session, we first introduce new features and improvements in the migration framework to support bucketed tables and increase automation. Next, we will deep dive into the top technical challenges we encountered and how we addressed them. We improved the the syntax compatibility between Hive and Spark from around 51% to 85% by identifying/developing top missing features, fixing incompatible UDFs, and implementing a UDF testing framework. In addition, we developed reliable join operators to improve Spark stability in production when leveraging optimizations such as ShuffledHashJoin. Finally, we will share an update on our overall migration effort and examples of migrations wins. For example, we were able to migrate one of the most complicated workloads in Facebook from Hive to Spark with more than 2.5X performance gain. Session hashtag: #Exp4SAIS