Jun Guo is in charge of data engine team at Bytedance. His team is focusing on data warehouse architecture development and optimization for a EB level data platform. Spark SQL is one of the most important engine in this team and Spark SQL process hundreds of PB of data each day. Prior to Bytedance, he worked for Cisco and eBay, where he focused on data platform and data warehouse infrastructure optimization.
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
However, Spark SQL bucketing has various limitations:
As a direct consequence of these efforts, we have witnessed over 90% growth in queries that leverage bucketing cross the entire data warehouse at Bytedance. In this talk, we present how we design and implement a new bucketing mechanism to solve all the above limitations and improve join and group-by-aggregate performance significantly.