SparkR Tutorial at useR 2016

AMPLab and Databricks gave a tutorial on SparkR at the useR conference. The conference was held from June 27 – June 30 at Stanford. In this blog post, we provide high-level introductions along with pointers to the training material and some findings from a survey we conducted during the tutorial.

Part I: Data Exploration

The first part of the tutorial was about big data exploration with SparkR. We started the tutorial with a presentation introducing SparkR. This included an overview of SparkR architecture and introduced three types of machine learning that is possible with SparkR:

  • Big Data, Small Learning
  • Partition, Aggregate
  • Large Scale Machine Learning

The hands-on exercise started with a brief overview of Databricks Workspace. We used R Notebooks in Databricks Community Edition to run R and SparkR commands. It is a free service that supports running Spark in Scala/Python and R.

Participants started by importing the first notebook into their workspace. As you can see in this notebook, we started by reading the one million songs dataset as a Apache Spark DataFrame and visually explored it with two techniques:

  • Summarizing and visualizing
  • Sampling and visualizing

The notebook introduces both techniques with practical examples and ends with a few exercises.

Part II: Advanced Analytics

In the second part of the tutorial we introduced machine learning algorithms that are available in SparkR. These include the SparkML algorithms that are exposed to R users through a natural R interface. For example, SparkR users can take advantage of a distributed GLM implementation just the same way they would use existing glmnet package. We also introduced two new powerful API that have been added to SparkR in Apache Spark 2.0.

  • dapply used for applying an R function on all partitions of Spark DataFrame in parallel
  • spark.lapply used for parallelizing R functions in multiple machines/workers

The second notebook again used the Million Songs dataset to do K-Means clustering and also built a predictive model using GLM. Like the first part, it ends with a few exercises for further practice.

Survey Results

A chart showing the distribution of the number of survey participants by job title.

Here is a short summary of survey responses. More than half of the attendees were data scientists, and about 20% were students. When asked about their use cases of R, every one listed “data cleaning and wrangling” as a use case. The majority (~80%) also included “data exploration” and “predictive analytics” as their uses for R. A large majority of participants indicated that they load their data into R, from local filesystem. Loading from RDBMS systems was second in popularity with 60%.

Majority of participants were dplyr users, and about 60% indicated that they prefer hadleyverse for data cleaning and wrangling. When asked about how they communicate their findings, the most popular method is publishing R plots in slides/documents and closely after is sharing rMarkdown files.

Survey results outlining how familiar users were with SparkR.

More than half of the attendees had never used SparkR or MLLib and about 25% were actively considering both. We hope this tutorial was helpful to the attendees.

What’s Next?

If you want to try these notebooks do the following:

  1. S'inscrire à la Databricks Community Edition
  2. Import SparkR tutorials part-1 and part-2 into Databricks Community Edition