Xin Yao is a tech lead at Pinterest. Xin is passionate about building and scaling distributed systems. Xin was priorly at Facebook Spark team, focusing on improving spark performance and efficiency. Before that, Xin worked as a Senoir Software Engineer at Hulu, where he maintained spark, built and scaled multiple batch/realtime pipelines. Xin received his master from Beijing University of Posts and Telecommunications in 2013.
In this presentation we want to share our experience in migrating Spark workload for one of the most critical clusters inside Pinterest. This includes two important changes in the software stack. First, the storage layer is changed from HDFS to S3. Second, the resource scheduler is switched from Mesos to YARN. We will share our motivation of the migration, experiences in resolving several technical challenges such as s3 performance, s3 consistency, s3 access control to match the feature and performance of HDFS. We make changes in job submission to address the differences in Mesos and Yarn. In the meantime, we optimized the Spark performance by profiling and select the most suitable EC2 instance type. After all, we achieved good performance results and a smooth migration process.
Over the years, Facebook has used Hive as the primary query engine to be used by our data engineers. Since Hive uses SQL-like query language called HQL, the list of built-in User Defined Functions (UDFs) did not always satisfy our customer requirements and as a result, an extensive list of custom UDFs was developed over time. As we started migrating pipelines from Hive to Spark SQL, a number of custom UDFs appeared incompatible with Spark, and many others showed bad performance. In this talk will first take a deep dive into how Hive UDFs work with Spark. We will then share what challenges we overcame on the way to support 99.99% of the custom UDFs in Spark.