Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
Session hashtag: #SAISDL2
Jim Dowling is the CEO of Logical Clocks AB, as well as an Associate Professor at KTH Royal Institute of Technology in Stockholm, and a Senior Researcher at SICS RISE. He is the lead architect of Hops Hadoop, the world's most fastest and most scalable Hadoop distribution and only Hadoop platform with support for GPUs as a resource. His research concentrates on building systems support for machine learning at scale. He is a regular speaker at Big Data and AI industry conferences, and blogs at O'Reilly on AI.