A Data Frame Abstraction Layer for SparkR

Download Slides

The data frame is a fundamental construct in R programming and is one of the primary reasons why R has become such a popular language for data analysis. In Spark 1.3, SparkSQL received its own implementation of the data frame concept which extends the SchemaRDD with additional methods that enable more intuitive data manipulation. The similarities between R’s data frame and its analogue in SparkSQL represent a clear opportunity for development within the SparkR project. A natural, and proven, path forward is to create R bindings to the existing Spark DataFrame API and extend it to work with R’s traditional data frame syntax. With DataFrame support, we expose new functionality that allows R programmers to quickly become proficient at working with large, distributed datasets in Spark . In this talk, we describe how this was accomplished and provide a demonstration of the enhanced functionality in SparkR.