Ke Sun

Senior Engineer, Bytedance

Senior engineer on big data from ByteDance with multiple years of experience using & developing Spark. Responsible for performance & feature improvement of SparkSQL and storage space improvement in Bytedance. Focused in providing an easy way for internal users to accelerate ETL job performance without too much human intervention.

Past sessions

Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).

Over the last year, we've added a series of optimizations in Spark to improve parquet pushdown performance. We developed a feature named LocalSort adding a sort step by some columns when writing parquet files resulting in obvious discrimination of statistical data across parquet row groups and higher compression ratio(according to history queries automatically and no need to modify ETL jobs). Furthermore, we developed a feature named Prewhere. Prewhere parquet reader selects low overhead columns from pushdown filters and reads batch data of these columns and filters data using pushdown filters and skips unnecessary batch of other projection columns. As a direct consequence of these efforts, we've achieved 30% average query improvement, 40% storage improvement of some tables and only 5% overhead.

In this talk, we'll take a deep dive into the internals of LocalSort and Prewhere, describe use-cases where LocalSort/Prewhere is useful, touch upon some work to automatically suggest sort columns based on history queries.

Speakers: Ke Sun and Jun Guo