I am currently a third year Ph.D. student in the Computer Science and Engineering Department at University of California, San Diego. My research interest lies broadly in the intersection of Systems and Machine Learning, an emerging area which is increasingly referred to as Systems for ML. In this space, I operate as a data management researcher. Taking inspirations from classical data management techniques, I build new abstractions, algorithms, and systems to improve efficiency, scalability, and usability of machine learning workloads. I am also interested in large-scale applied ML, which opens new systems challenges.
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection. This empirical process involves exploring the deep net architecture and hyper-parameters, often requiring hundreds of trials. Alas, most ML systems, including Spark, focus on training one model at a time, reducing throughput and raising costs; some also sacrifice reproducibility. We present Cerebro, a system to raise deep net model selection throughput at scale without raising resource costs and without sacrificing reproducibility or accuracy. Cerebro uses a novel parallel SGD execution strategy from our research which we call model hopper parallelism.* It is also general enough to work on top of Spark. This talk is about Cerebro and its integration into Spark.
First, we will review the state-of-art research in this active field and introduce Cerebro. We will go over the core intuitions, designs, and architectures of it. Then we will demonstrate with experiments that considering resource efficiency, including memory/storage usage, runtime, and communication costs, standalone Cerebro can outperform existing systems, including Horovod and task-parallelism. Finally, we will showcase the integration of Cerebro into Spark by using/extending the Spark APIs. We will describe the challenges and technical efforts required to achieve this goal. We will then show some experiments to exhibit that Cerebro on Spark can be a more resource-efficient choice for Spark users working on deep learning model selection. *Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems. Supun Nakandala, Yuhao Zhang, and Arun Kumar. ACM SIGMOD 2019 DEEM Workshop