Tom Phelan is co-founder and chief architect of BlueData. Prior to BlueData, Tom was an early employee at VMware and as senior staff engineer was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular “pluggable storage architecture.” He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, he was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64 bit file system.
Today, most any application can be "Dockerized". However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with several practical tips and techniques for running Spark in a container environment. Containers are typically used to run non-distributed applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a distributed application in a secure multi-host container environment. There are also some decisions that need to be made about the tools and infrastructure. For example, there are a number of different container managers, orchestration frameworks, and resource schedulers available today including Mesos, Kubernetes, Docker Swarm, and more. Each has its own strengths and weaknesses; each has unique characteristics that may be being suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers. This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.Learn more:
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed Big Data applications like Apache Spark. Some of these challenges include container lifecycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. At BlueData, we’re “all in” on Docker containers – with a specific focus on Spark applications. We’ve learned first-hand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy Big Data workloads using Docker. In this session, you’ll learn about networking Docker containers across multiple hosts securely. We’ll discuss ways to achieve high availability across distributed Big Data applications and hosts in your data center. And since we’re talking about very large volumes of data, performance is a key factor. So we’ll discuss some of the storage options we explored and implemented at BlueData to achieve near bare-metal I/O performance for Spark using Docker. We’ll share our lessons learned as well as some tips and tricks on how to Dockerize your Big Data applications in a reliable, scalable, and high-performance environment.