Daniel Dai is currently working on data processing platform in Pinterest. He is PMC member for Apache Hive and Pig. He has a PhD in Computer Science with specialization in computer security, data mining and distributed computing from University of Central Florida. He is interested in data processing, distributed system and cloud computing.
In this presentation we want to share our experience in migrating Spark workload for one of the most critical clusters inside Pinterest. This includes two important changes in the software stack. First, the storage layer is changed from HDFS to S3. Second, the resource scheduler is switched from Mesos to YARN. We will share our motivation of the migration, experiences in resolving several technical challenges such as s3 performance, s3 consistency, s3 access control to match the feature and performance of HDFS. We make changes in job submission to address the differences in Mesos and Yarn. In the meantime, we optimized the Spark performance by profiling and select the most suitable EC2 instance type. After all, we achieved good performance results and a smooth migration process.