Ph.D. student in Computer Science and Engineering Department at the University of California, San Diego. Advised by Prof. Arun Kumar. Research interest focuses on machine learning systems intending to make data science easier and faster. Has been working on data systems for video analytics and deep learning model selection.
June 25, 2020 05:00 PM PT
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection. This empirical process involves exploring the deep net architecture and hyper-parameters, often requiring hundreds of trials. Alas, most ML systems, including Spark, focus on training one model at a time, reducing throughput and raising costs; some also sacrifice reproducibility. We present Cerebro, a system to raise deep net model selection throughput at scale without raising resource costs and without sacrificing reproducibility or accuracy. Cerebro uses a novel parallel SGD execution strategy from our research which we call model hopper parallelism.* It is also general enough to work on top of Spark. This talk is about Cerebro and its integration into Spark.
First, we will review the state-of-art research in this active field and introduce Cerebro. We will go over the core intuitions, designs, and architectures of it. Then we will demonstrate with experiments that considering resource efficiency, including memory/storage usage, runtime, and communication costs, standalone Cerebro can outperform existing systems, including Horovod and task-parallelism. Finally, we will showcase the integration of Cerebro into Spark by using/extending the Spark APIs. We will describe the challenges and technical efforts required to achieve this goal. We will then show some experiments to exhibit that Cerebro on Spark can be a more resource-efficient choice for Spark users working on deep learning model selection. *Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems. Supun Nakandala, Yuhao Zhang, and Arun Kumar. ACM SIGMOD 2019 DEEM Workshop