As a common sense, spark can’t support large scale machine learning well, because of the model size may exceed the memory limitation of single node. However, we develop an algorithm can training logistic regression and softmax model with 1 trillion dimensions on the standard spark version in 15 minutes (500 million training samples) . To achieve this goal, we have proposed a new optimization algorithm, carefully chosen appropriate distribution method, and applied model compression technologies.
Zhang Xiatian is the chief data scientist of TalkingData. He has long engaged in machine learning research and has dozens of research papers in publication. He also has much of experience of the applications of machine learning , such as recommender systems, and computing advertising. Xiatian is managing the Data Science Center of TalkingData, which develop new technology and application to support the core business, and explore future direction of TalkingData. He used to work for IBM China Research Institute, Tencent data platform, Huawei Noah's Ark Lab.