Abhishek Roy is a Research Engineer in Gray Systems Lab (GSL) at Microsoft. His research focuses on improving performance of big data query engines in the cloud. Before joining GSL, he was a graduate researcher in the database group at University of Massachusetts Amherst. As part of his dissertation, he collaborated with New York Genome Center to build a big data platform for genomic data pipelines. Ultimately, he hopes that his work will make data processing faster and more resource efficient for the end users.
Queries in production workloads and interactive data analytics are often overlapping, i.e., multiple queries share parts of the computation. These redundancies increase the processing time and total cost for the user. To reuse computations, many big data processing systems support materialized views. However, it is challenging to manually select common computations in the workload given the size and evolving nature of the query workloads. In this talk, we will present Spark Cruise, an automatic computation reuse system developed for Spark. It can automatically detect overlapping computations in the past query workload and enable automatic materialization and reuse in future Spark SQL queries.
SparkCruise requires no active involvement from the user as the materialization and reuse is applied automatically in the background as part of query processing. We can perform all these steps without changing the Spark code, thus demonstrating the extensibility of Spark SQL engine. Spark Cruise has shown to improve the overall runtime of TPC-DS queries by 30%. Our talk will be divided into three parts. First, we will explain the end-to-end system design with focus on how we added workload awareness to the Spark query engine. Then, we will demonstrate all the steps including analysis, feedback, materialization, and reuse on a live Spark cluster. Finally, we will show the workload insights notebook. This Python notebook displays the information from query plans of the workload in a flat table. This table helps the users and administrators to understand the characteristics of their workloads and the cost/benefit tradeoff of enabling SparkCruise.