Amogh Margoor is currently leading the Spark efforts at Qubole. His area of interests are Compilers, Big Data Engines and SQL Optimizers. At Qubole, prior to Spark he has worked on the internals of Presto and Hive to help them scale to thousands of nodes on cloud. Specifically, he has worked on join optimizations like Dynamic Filtering and Cost based Join Reordering, Join Distribution types, SQL Optimizer work for supporting Materialized Views and OLAP Cubes, Auto-Tuning Big Data Stack, infrastructure for AIR (data intelligence and QDS internal Data warehouse), Object Store (S3) optimizations for cloud etc.
At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work, we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries.
However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations.
Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration.
In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them.  Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD