Running Spark Inside Docker Containers: From Workload to Cluster

Download Slides

This presentation describes the journey we went through in containerizing Spark workload into multiple elastic Spark clusters in a multi-tenant kubernetes environment. Initially we deployed Spark binaries onto a host-level filesystem, and then the Spark drivers, executors and master can transparently migrate to run inside a Docker container by automatically mounting host-level volumes. In this environment, we do not need to prepare a specific Spark image in order to run Spark workload in containers. We then utilized Kubernetes helm charts to deploy a Spark cluster. The administrator could further create a Spark instance group for each tenant. A Spark instance group, which is akin to the Spark notion of a tenant, is logically an independent kingdom for a tenant’s Spark applications in which they own dedicated Spark masters, history server, shuffle service and notebooks. Once a Spark instance group is created, it automatically generates its image and commits to a specified repository. Meanwhile, from Kubernetes’ perspective, each Spark instance group is a first-class deployment and thus the administrator can scale up/down its size according to the tenant’s SLA and demand. In a cloud-based data center, each Spark cluster can provide a Spark as a service while sharing the Kubernetes cluster. Each tenant that is registered into the service gets a fully isolated Spark instance group. In an on-prem Kubernetes cluster, each Spark cluster can map to a Business Unit, and thus each user in the BU can get a dedicated Spark instance group. The next step on this journey will address the resource sharing across Spark instance groups by leveraging new Kubernetes’ features (Kubernetes31068/9), as well as the Elastic workload containers depending on job demands (Spark18278). Demo:
Session hashtag: #EUeco5

About Haohai Ma

Haohai Ma is a Software Architect at Spectrum Computing, IBM. In the past 10 years, he is the expertise on HPC, SOA, MapReduce, Yarn, Spark , Docker and Kubernetes. He is one of the major contributors of IBM BigData offerings. Haohai holds PhD in Computer Science from Peking University

About Khalid Ahmed

Khalid Ahmed is an STSM, Chief Architect of Infrastructure Software at IBM Spectrum Computing. He works on the design and architecture of large scale grid and cloud computing systems with focus on scheduling, resource, workload and data management. In over 20 years at industry experience he has worked in a number of roles including development, product management and architecture. His latest interests include big data systems, container technology and data center operating system concepts. Khalid has a M.A.Sc from University of Toronto.