Deep learning technology has been adopted in many domains. The model training solution has different paradigms depending on the scenario. In some cases, e.g. recommendation, graph, the model size can exceed single node memory to achieve better performance. In such case, the existing all reduce data-parallel model training may not be well suitable. In this session, we introduce a new solution to meet the challenge of training large scale deep learning models on Apache Spark. Two frameworks: Angel and BigDL are combined. BigDL is a Spark-native distributed deep learning framework, which provides fast operator implementation on the Spark CPU cluster. It will load and optimize the model graph for execution. However, the distributed training of BigDL is based on Spark block manager and still an all reduce data-parallel fashion, which cannot handle model bigger than single node memory. Thus, we bring Angel to improve this. Angel is a flexible and powerful parameter server for large-scale machine learning. The model will be distributed stored in the Angel parameter server cluster. A large model often consumes sparse data. Angel provide flexible synchronization mechanism to leverage such sparse feature in the training process. It also provides fine-grained parameter fetch/update to make the communication more efficient.
Fitz wang, Ph.D, a senior software researcher at Tencent. He is the TAC member of Linux Foundation Artificial Intelligence (FLAI), the owner of Angel poject. He has years of experience in distributed machine learning and big data.