Priyanka Gomatam is a Software Engineer in Azure Data at Microsoft. Prior to Microsoft, she worked on building an AI and ML Platform at Visa. Priyanka has completed her Masters in Computer Science from The Pennsylvania State University. Her interests lie in building cloud-scale Analytics-as-a-Service platforms and big data systems.
June 25, 2020 05:00 PM PT
Queries in production workloads and interactive data analytics are often overlapping, i.e., multiple queries share parts of the computation. These redundancies increase the processing time and total cost for the user. To reuse computations, many big data processing systems support materialized views. However, it is challenging to manually select common computations in the workload given the size and evolving nature of the query workloads. In this talk, we will present Spark Cruise, an automatic computation reuse system developed for Spark. It can automatically detect overlapping computations in the past query workload and enable automatic materialization and reuse in future Spark SQL queries.
SparkCruise requires no active involvement from the user as the materialization and reuse is applied automatically in the background as part of query processing. We can perform all these steps without changing the Spark code, thus demonstrating the extensibility of Spark SQL engine. Spark Cruise has shown to improve the overall runtime of TPC-DS queries by 30%. Our talk will be divided into three parts. First, we will explain the end-to-end system design with focus on how we added workload awareness to the Spark query engine. Then, we will demonstrate all the steps including analysis, feedback, materialization, and reuse on a live Spark cluster. Finally, we will show the workload insights notebook. This Python notebook displays the information from query plans of the workload in a flat table. This table helps the users and administrators to understand the characteristics of their workloads and the cost/benefit tradeoff of enabling SparkCruise.