Spark SQL is a very effective distributed SQL engine for OLAP and widely adopted in Baidu production for many internal BI projects. However, Baidu has also been facing many challenges for large scale including tuning the shuffle parallelism for thousands of jobs, inefficient execution plan, and handling data skew.
In this talk, we will explore Intel and Baidu’s joint efforts to address challenges in large scale and offer an overview of an adaptive execution mode we implemented for Baidu’s Big SQL platform which is based on Spark SQL. At runtime, adaptive execution can change the execution plan to use a better join strategy and handle skewed join automatically. It can also change the number of reducer to better fit the data scale. In general, adaptive execution decreases the effort involved in tuning SQL query parameters and improves the execution performance by choosing a better execution plan and parallelism at runtime.
We’ll also share our experience of using adaptive execution in Baidu’s production cluster with thousands of server, where adaptive execution helps to improve the performance of some complex queries by 200%. After further analysis we found that several special scenarios in Baidu data analysis can benefit from the optimization of choosing better join type. We got 2x performance improvement in the scenario where the user wanted to analysis 1000+ advertisers’ cost from both web and mobile side and each side has a full information table with 10 TB parquet file per-day. Now we are writing probe jobs to detect more scenarios from current daily jobs of our users. We are also considering to expose the strategy interface based on the detailed metrics collected form adaptive execution mode for the upper users.
Session hashtag: #Exp5SAIS
Carson Wang is a big data software engineer at Intel, focusing on developing and improving new big data technologies. He is an active open source contributor to the Apache Spark and Alluxio projects. He is also a core developer and maintainer of HiBench - an open sourced bigdata micro benchmark suite. Prior to Intel, Carson worked for Microsoft on Windows Azure.
Yuanjian is a senior engineer and the lead for distributed computing team at Baidu. He and his team develop and support the internal MapReduce and Spark platform in Baidu. He is also a Spark contributor. Prior of that, He worked on real time computing and distributed tracing system.