TensorFlow On Spark: Scalable TensorFlow Learning on Spark Clusters

In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. While these approaches are a step in the right direction, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly. This session will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training and inferencing on Spark clusters. It supports all TensorFlow functionalities, including synchronous & asynchronous learning, model & data parallelism and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow (pushing vs. pulling) and network protocols (gRPC and RDMA) for server-to-server communication. Its Python API makes the integration with existing Spark libraries like MLlib easy. The speakers will walk through multiple examples to outline these key capabilities, and share benchmark results about scalability. Learn how, with a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application. You’ll also be given tangible takeaways on how deep learning could be easily conducted on cloud or on-premise with a new framework. Session hashtag: #SFdev9

About Andy Feng

Andy Feng is a VP Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He's architected major platforms for personalization, ads serving, NoSQL, and cloud infrastructure.

About Lee Yang

Lee Yang is a Principal Engineer at Yahoo, working on large-scale systems and machine learning platforms.