The SparkR project provides language bindings and runtime support to enable users to run scalable computation from R using Apache Spark. SparkR has an active set of contributors from many companies and a number of recent developments have improved performance and usability. Some of the improvements include(a) a new R to JVM bridge that enables easy deployment to YARN clusters,
(b) serialization-deserialization routines that enable integration with other Spark components like ML Pipelines,
(c) complete RDD API with support coming for DataFrames and
(d) performance improvements for various operations including shuffles.
This talk will present an overview of the project, outline some of the technical contributions and discuss new features we will build over the next year. We will also present a demo showcasing how SparkR can be used to seamlessly process large datasets on a cluster directly from the R console.