I have over 21 years of experience in Software development in the domain of optical networking, financial, internet, distributed computing, and big data. My current interests are related to data pipelines, machine learning, and GPUs.
Of late, Graphics Processing Units (GPUs) are becoming popular for their extraordinarily low price per flop (performance) due to very high number cores (typically hundreds on PCs/laptops to thousands on server class GPUs). This metric has improved considerably in the past few years and has left CPUs far behind. This "embarrassingly parallel" computing power is generally exploited through CUDA and OpenCL programming languages (of late Java/Scala too) and is suitable for some class of problems/algorithms which are parallelizable on this platform. While spark can distribute computation across nodes in the form of RDD partitions, within a partition, computation is performed on CPU cores. If GPU based computation is used on spark, it can lead to vastly reduced cluster size, thereby reducing the cost. The computational model of GPUs can be efficient for contiguous data. Columnar data layout of DataFrames is ideal for offloading some of the computation to GPUs due to the contiguous data in columnar layout. We propose that the within partition implementation of some DataFrame APIs like filter, selectExpr, groupBy, join, agg be changed to the parallel GPU based one. In this session, the performance gain on filter and selectExpr is presented. An earlier effort to exploit GPUs for spark has been presented in SparkCL where users develop spark code using SparkCL APIs. Our work complements this effort by automatically letting some of the internal computation of DataFrames happen on GPUs without any code change by the users.