Daniel Dai

Software Engineer, Pinterest

Daniel Dai is currently working on data processing platform in Pinterest. He is PMC member for Apache Hive and Pig. He has a PhD in Computer Science with specialization in computer security, data mining and distributed computing from University of Central Florida. He is interested in data processing, distributed system and cloud computing.

Past sessions

Pinterest is moving all batch processing to Apache Spark, which includes a large amount of legacy ETL workflows written in Cascading/Scalding. In this talk, we will share the challenges and solutions we experienced during this migration, which includes the motivation of the migration, how to fill the semantic gap between different engines, the difficulty dealing with thrift objects widely used in Pinterest, how we improve Spark accumulators, how to tune the Spark performance after migration using our innovative Spark profiler, and also the performance improvements and cost saving we have achieved after the migration.

In this session watch:
Daniel Dai, Software Engineer, Pinterest
Zirui Li, Developer, Pinterest


Summit 2020 From HDFS to S3: Migrate Pinterest Apache Spark Clusters

June 25, 2020 05:00 PM PT

In this presentation we want to share our experience in migrating Spark workload for one of the most critical clusters inside Pinterest. This includes two important changes in the software stack. First, the storage layer is changed from HDFS to S3. Second, the resource scheduler is switched from Mesos to YARN. We will share our motivation of the migration, experiences in resolving several technical challenges such as s3 performance, s3 consistency, s3 access control to match the feature and performance of HDFS. We make changes in job submission to address the differences in Mesos and Yarn. In the meantime, we optimized the Spark performance by profiling and select the most suitable EC2 instance type. After all, we achieved good performance results and a smooth migration process.