Dr. Jing Pan is a Sr. Staff Data Scientist/User Experience Researcher at eHealth Inc. She oversees all customer facing modeling projects and technical evaluations of third party services and/or merger-acquisitions. She is passionate about the productionization of deep learning models on Spark clusters. She is the first in the world to apply Rectified Adam optimizer on HorovodRunner enabled spark clusters for distributed deep learning training in 2019. At Fanatics Inc., she was perhaps the first one in the world to serve deep learning model trained on Keras in a distributed fashion on Spark slave nodes in 2017.
June 23, 2020 05:00 PM PT
The freedom of fast iterations of distributed deep learning tasks is crucial for smaller companies to gain competitive advantages and market shares from big tech giants. Horovod Runner brings this process to relatively accessible spark clusters. There have been, however, no benchmark tests on Horovod Runner per se, and very limited scalability benchmark tests on Horovod, the predecessor requiring custom built GPU clusters. For the first time, we show that Databricks' Horovod Runner achieves significant lift in scaling efficiency for the convolutional neural network (CNN, hereafter) based tasks on both GPU and CPU clusters.
We also implemented the Rectified Adam optimizer for the first time in Horovod Runner. In addition to show test results, we will also discuss lessons we learned on how to do it such as: