Rui Liu is a director of engineering at Novumind Inc. He has recently worked on deep learning system and data pipeline. Before joining Novumind, he was a system software engineer at HP Vertica advanced R&D Lab, where he worked on Vertica query optimizer, storage engine, and Vertica/Spark integration. Previously, he was a software engineer at the data analytic infrastructure group of LinkedIn, where his responsibility was around building a computation engine for pan-LinkedIn data. Previously, he was postdoctoral research fellow at the database group of the University of Waterloo, building databases for cloud environment. Earlier, he was a researcher at HP Labs China, where his research area was to facilitate data intensive computations within and over database engines.
Spark provides a wide range of functionalities for data ingestion and preprocessing for deep learning systems. However, the training process of deep learning has different properties from data processing systems. It is a computation and communication intensive process. The back-propagation algorithm needs GPU to achieve acceptable speed; and the SGD algorithm demands low latency synchronization of weights at very high frequencies. On the contrary, big data processes are usually bounded by IO or network throughput, and optimized for data locality, minimization of data IO and shuffling. Our Spark deep learning system is designed to leverage the advantages of the two worlds, Spark and high-performance computing. We designed GPU data frame that allows programmers to launch training processes from regular Spark applications, seamlessly integrating the training with the Spark data processing pipeline, and having the full control of the training process from the Spark application. Beneath the GPU data frame, our training process is running outside of Spark as a service, which is highly optimized for GPU computing, NUMA binding and RDMA. The training data is shared from Spark off-heap buffer to our training process in Apache Arrow format and ZERO-copied. The training results and the deep learning model, can be directly used by the Spark application for follow-up processing. Session hashtag: #DLSAIS15
We present a Vertica data connector for Spark that integrates with Spark's Datasource API so that data from Vertica can be efficiently loaded into Spark. A simple JDBC Spark Datasource that loads all data of a Vertica table into Spark is not optimal, because it does not take advantage of the pre-processing capabilities of Vertica to reduce the amount of data to be transferred or leverage parallel processing effectively. Our solution connects the computational pipelines of Spark and Vertica in an optimized way, and not only utilizes parallel channels for data movement, but also (a) pushes computation down into Vertica when appropriate, and (b) maintains data-locality when transferring data between the two systems. Operations on structured data (such as those operations expressed in Spark SQL) can be processed by Vertica, Spark, or a combination of both. The connector controls Vertica's table data flowing through query execution plans in parallel; the data is then transferred into Spark's pipeline. Our push-down optimizations identify opportunities to reduce the data volume transferred by allowing Vertica to pre-process the filter, project, join and group-by operators before passing the data into Spark. When using a simple connection scheme, parallel connections to Vertica quickly saturate the network bandwidth of the database nodes, becoming the bottleneck due to inter-Vertica-node shuffling. This happens because each Spark task pulls a specific partition (range) of the input data, which is typically scattered across the Vertica cluster. To address this, we have devised an innovative solution to reduce data movement within Vertica. Using a consistent hash ring, the connector guarantees that there is no unnecessary data shuffling inside Vertica, minimizing network bandwidth and optimizing the data flow from Vertica to Spark. Each query execution plan inside Vertica only targets the data (i.e., segment of the data) that is local to each node.Additional Reading: