Ron Hu is a Senior Staff Software Engineer at LinkedIn where he works on building a big data analytics platform based on Apache Spark. Prior to joining LinkedIn, he used to work at Teradata, MarkLogic, and Huawei with a focus on parallel database systems and search engines. Ron holds a PhD in Computer Science from University of California, Los Angeles.
May 27, 2021 03:15 PM PT
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
June 5, 2018 05:00 PM PT
Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios.
In this talk, we'll take a deep dive into how Spark's Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram's impact on query plan change, hence leading to performance gain.
Session hashtag: #DevSAIS13