Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together. It contains three major projects: 1) barrier execution mode 2) optimized data exchange and 3) accelerator-aware scheduling. A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps. First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work. Second, we will give updates on accelerator-aware scheduling and how it shall help accelerate your Spark training jobs. We will also outline on-going work for optimized data exchange.
Xingbo Jiang is a software engineer at Databricks, where he investigates the use cases on Spark Core and Spark SQL. Xingbo is an active contributor to Apache Spark. His areas of interest include distributed system, database, and data warehouse.