Jun Guo

Head of Data Engine Team, Bytedance

Jun Guo is in charge of data engine team at Bytedance. His team is focusing on data warehouse architecture development and optimization for a EB level data platform. Spark SQL is one of the most important engine in this team and Spark SQL process hundreds of PB of data each day. Prior to Bytedance, he worked for Cisco and eBay, where he focused on data platform and data warehouse infrastructure optimization.



Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleSummit 2020

Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.

However, Spark SQL bucketing has various limitations:

  1. The bucketing mechanism in Spark SQL is different from the one in Hive so that migration from Hive to Spark SQL is expensive;
  2. Spark SQL bucketing requires sorting on read time which greatly degrades the performance;
  3. When Spark writes data to a bucketing table, it can generate tens of millions of small files which are not supported by HDFS;
  4. Bucket joins are triggered only when the two tables have the same number of bucket;
  5. It requires the bucket key set to be identical with the join key set or grouping key set. Over the last year, we have added a series of optimizations in Apache Spark to eliminate the above limitations so that the new bucketing mechanism can cover more scenarios. And the new bucketing make Hive to Spark SQL migration more smooth.

As a direct consequence of these efforts, we have witnessed over 90% growth in queries that leverage bucketing cross the entire data warehouse at Bytedance. In this talk, we present how we design and implement a new bucketing mechanism to solve all the above limitations and improve join and group-by-aggregate performance significantly.