Edward Ma is a software engineer at HPE Vertica. He has been a core contributor to several products for data analytics and machine learning, such as HPE Distributed R, and maintains close ties with the community for R, for which he has developed several packages, such as vertica.dplyr and ddR. He currently works to improve the symbiosis between Spark and Vertica. He holds a Master’s degree from Carnegie Mellon University.
We present a Vertica data connector for Spark that integrates with Spark's Datasource API so that data from Vertica can be efficiently loaded into Spark. A simple JDBC Spark Datasource that loads all data of a Vertica table into Spark is not optimal, because it does not take advantage of the pre-processing capabilities of Vertica to reduce the amount of data to be transferred or leverage parallel processing effectively. Our solution connects the computational pipelines of Spark and Vertica in an optimized way, and not only utilizes parallel channels for data movement, but also (a) pushes computation down into Vertica when appropriate, and (b) maintains data-locality when transferring data between the two systems. Operations on structured data (such as those operations expressed in Spark SQL) can be processed by Vertica, Spark, or a combination of both. The connector controls Vertica's table data flowing through query execution plans in parallel; the data is then transferred into Spark's pipeline. Our push-down optimizations identify opportunities to reduce the data volume transferred by allowing Vertica to pre-process the filter, project, join and group-by operators before passing the data into Spark. When using a simple connection scheme, parallel connections to Vertica quickly saturate the network bandwidth of the database nodes, becoming the bottleneck due to inter-Vertica-node shuffling. This happens because each Spark task pulls a specific partition (range) of the input data, which is typically scattered across the Vertica cluster. To address this, we have devised an innovative solution to reduce data movement within Vertica. Using a consistent hash ring, the connector guarantees that there is no unnecessary data shuffling inside Vertica, minimizing network bandwidth and optimizing the data flow from Vertica to Spark. Each query execution plan inside Vertica only targets the data (i.e., segment of the data) that is local to each node.Additional Reading: