From HDFS to S3: Migrate Pinterest Apache Spark Clusters - Databricks

From HDFS to S3: Migrate Pinterest Apache Spark Clusters

In this presentation we want to share our experience in migrating Spark workload for one of the most critical clusters inside Pinterest. This includes two important changes in the software stack. First, the storage layer is changed from HDFS to S3. Second, the resource scheduler is switched from Mesos to YARN. We will share our motivation of the migration, experiences in resolving several technical challenges such as s3 performance, s3 consistency, s3 access control to match the feature and performance of HDFS. We make changes in job submission to address the differences in Mesos and Yarn. In the meantime, we optimized the Spark performance by profiling and select the most suitable EC2 instance type. After all, we achieved good performance results and a smooth migration process.

« back
About Daniel Dai


Daniel Dai is currently working on data processing platform in Pinterest. He is PMC member for Apache Hive and Pig. He has a PhD in Computer Science with specialization in computer security, data mining and distributed computing from University of Central Florida. He is interested in data processing, distributed system and cloud computing.

About Xin Yao


Xin Yao is a Software Engineer at Facebook Spark team. Before Facebook, Xin worked as a Senoir Software Engineer at Hulu, where he built the realtime ETL pipeline and scaled data warehouse. Xin received his master from Beijing University of Posts and Telecommunications in 2013.