The freedom of fast iterations of distributed deep learning tasks is crucial for smaller companies to gain competitive advantages and market shares from big tech giants. Horovod Runner brings this process to relatively accessible spark clusters. There have been, however, no benchmark tests on Horovod Runner per se, and very limited scalability benchmark tests on Horovod, the predecessor requiring custom built GPU clusters. For the first time, we show that Databricks’ Horovod Runner achieves significant lift in scaling efficiency for the convolutional neural network (CNN, hereafter) based tasks on both GPU and CPU clusters.
We also implemented the Rectified Adam optimizer for the first time in Horovod Runner. In addition to show test results, we will also discuss lessons we learned on how to do it such as:
Dr. Jing Pan is a Sr. Staff Data Scientist/User Experience Researcher at eHealth Inc. She oversees all customer facing modeling projects and technical evaluations of third party services and/or merger-acquisitions. She is passionate about the productionization of deep learning models on Spark clusters. She is the first in the world to apply Rectified Adam optimizer on HorovodRunner enabled spark clusters for distributed deep learning training in 2019. At Fanatics Inc., she was perhaps the first one in the world to serve deep learning model trained on Keras in a distributed fashion on Spark slave nodes in 2017.
Wendao Liu received his master's degree from the prestigious Drexel University's LeBow College of Business. He is a Ph.D student in Business Administration and at the same time works at eHealth, Inc. as a full-stack data scientist. With his rare combination of business mindset and strong technical skills, he can not only tackle data issue, but also leverage data to drive business performance. He identifies business opportunities, optimize product performance and provide recommendations. He builds end to end customer unification data product, which leverage the machine learning techniques to provide reliable linkage across disparate systems.