Yucai Yu is a big data architect at eBay’s Data Services and Solutions (DSS) group, leading the Spark optimization for the data warehouse. He is an active contributor to the Apache Spark and familiar with system performance analysis and tuning. Prior to eBay, he worked at Intel, where he researched various cutting-edge hardware technology to accelerate big data and cloud.
October 3, 2018 05:00 PM PT
eBay is migrating its 30 PB MPP database to Apache Spark. Nowadays, 15000+ ETL jobs have been running on a 1000+ nodes Spark cluster each day, processing PB scale data and these numbers are increasing quickly. Optimization is critical during the migration, because the cluster resource is usually very stressful, well-optimized system can hold more jobs in the limited resource.
In this session, we will talk about the top performance challenges we encountered and how we addressed them. Every month, more batch jobs were being moved to Spark, it put much pressure on the cluster resource, especially the memory capacity. When we deep dived into the top 10 memory-intensive queries, we found that improper Spark configuration, such as executor memory and shuffle partition, lead to serious waste of memory. We will share a unified configuration solution, which is based on adaptive execution, a joint work by Intel and eBay, it helps us save half of the memory and huge human tuning effort.
Next, we have some very big historical tables, to process them efficiently, we need do both bucket and partition. But this way often leads to huge small files when the bucket number is big. In eBay, we use both Spark SQL’s bucket feature and parquet’s min-max index to implement the indexed bucket table, which shows the very good performance. Some important cases gets 2.5x improvement.
Finally, data skew is very common in large data warehouse, some wired OOMs are caused by it. We will root cause them and show an improved join algorithm for generic skewed join handling based on the runtime transformation.
Session hashtag: #SAISExp11