Zirui Li is a software engineer on Pinterest’s Spark team. He has been building Pinterest’s Spark platform and developing Spark functionalities. He’s mainly focusing on Pinterest’s in-house PySpark platform, Spark History Server, Spark tools and Spark optimization. He holds a Master Degree in Computational Data Science from Carnegie Mellon University.
May 27, 2021 03:15 PM PT
Pinterest is moving all batch processing to Apache Spark, which includes a large amount of legacy ETL workflows written in Cascading/Scalding. In this talk, we will share the challenges and solutions we experienced during this migration, which includes the motivation of the migration, how to fill the semantic gap between different engines, the difficulty dealing with thrift objects widely used in Pinterest, how we improve Spark accumulators, how to tune the Spark performance after migration using our innovative Spark profiler, and also the performance improvements and cost saving we have achieved after the migration.