Enterprises across all sectors have invested heavily in big data infrastructure (Hadoop, Impala, Spark, Kafka, etc.) to turn data into insights into business value. Clusters are getting bigger, more complex and employing more and more data scientists and engineers. As a result, it is increasingly challenging for Data Ops teams to operate and maintain these clusters to meet business requirements and performance SLAs. For instance, a single SQL query may fail or take a long time to complete for various reasons, such as SQL-level inefficiencies, data skew, missing and stale statistics, pool-level resource configurations, such that a resource-hogging query could impact the entire application stack on that cluster.
A critical capability to scale application performance is to do cluster-wide tuning. Examples include: tune the default application configurations so that all applications would benefit from that change, tune the pool-level resource allocations, identify wide-impact issues like slow nodes and too many small files, and many others. Cluster-level tuning requires considering more factors, and has a risk of significantly worsening cluster performance; however, it is often done via trial and error with educated guesswork, if attempted at all.
We employ machine learning and AI techniques to make cluster-level tuning easier, more data-driven, and more accurate. This talk will describe our methodology to learn from various sources of data such as the workload, the cluster and pool resources, metastore, etc., and provide recommendations for cluster defaults for application and pool resource configurations. We will also present a case study where a customer applied unravel tuning recommendations and achieved 114% increase in the number of applications running per day while using 47% fewer vCore-Hours and 15% fewer containers.
Eric Chu is the VP Data Insights at Unravel Data, where he leads the effort to automatically identify and improve inefficiencies in big data application performance. Previously, he was instrumental in managing a 1500 node cluster that ran over 4 million Hadoop applications at Rocket Fuel. Before that, he designed and implemented Microsoft's first online database management service that made it easy for database admins to manage databases on Microsoft SQL Azure. Eric received his PhD in Computer Science specializing in data management from the University of Wisconsin-Madison, and has presented at multiple research conferences such as SIGMOD and VLDB.