There is growing interest in running Apache Spark natively on Kubernetes. We will explain the design idioms, architecture and internal mechanics of Spark orchestrations over Kubernetes and the on-going work of the community. Since data for Spark analytics is often stored in HDFS, we will also explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as data locality and security through the use of Kubernetes constructs such as secrets and RBAC rules.
Session hashtag: #SAISDD5
Ilan Filonenko is a member of the Data Science Infrastructure team at Bloomberg, where he has designed and implemented distributed systems at both the application and infrastructure level. He is one of the principle contributors to Spark on Kubernetes, primarily focusing on the effort to enabled Secure HDFS interaction and non-JVM support. Previously, Ilan was an engineering consultant and technical lead in various startups and research divisions across multiple industry verticals, including medicine, hospitality, finance, and music. Ilan's currently researches algorithmic, software, and hardware techniques for high-performance machine learning, with a focus on optimizing stochastic algorithms and model management.