Teaching Apache Spark Clusters to Manage Their Workers Elastically

Download Slides

Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.

We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.

The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.

About Erik Erlandson

Erik Erlandson is a Software Engineer at Red Hat, where he investigates analytics use cases and scalable deployments for Apache Spark in the cloud. He also consults on internal data science and analytics projects. Erik is a contributor to Apache Spark and other open source projects in the Spark ecosystem, including the Spark on Kubernetes community project, Algebird and Scala.

About Trevor Mckay

Trevor McKay is a Senior Software Engineer at Red Hat with a background in distributed computing and big data processing. He has been a core contributor to the OpenStack data processing project (Sahara) which manages Apache Hadoop, Spark, and Storm deployments. He continues to focus on finding ways to make analytics easily accessible for developers and end-users