Jonghyun Bae

M.S./Ph.D. student, Sungkyunkwan University

Jonghyun Bae is a M.S./Ph.D. student in the Department of Electrical and Computer Engineering, Sungkyunkwan University (SKKU), Korea. He’s an author of RIGHT (R Interactive Graphics via HTml), an R package for interactive data visualization based on HTML canvas and JavaScript, which was sponsored by Google Summer of Code (GSoC) in 2014. He received his B.S. degree in Semiconductor Systems Engineering from SKKU in 2015. His research interests include parallel processing and cloud computing.

SESSIONS

ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization

ggplot2 is one of the most popular data visualization packages for R, which makes it easy to produce high-quality graphs using data represented in R data.frame. However, ggplot2 is not suitable for big data visualization for the following reasons. First, the maximum data size it can handle is limited by the physical memory size since R Virtual Machine (RVM) attempts to keep the entire data.frame in memory. Second, even if the data set fits in memory, it often takes a long time to import it from a file, partly due to the overhead of format conversion. Finally, ggplot2 does not effectively utilize abundant computing resources offered by today's parallel/distributed machines as the package itself is not parallelized. In this presentation, we introduce ggplot2.SparkR, an R package for scalable visualization of big data represented in Spark DataFrame. ggplot2.SparkR is an extension to the original ggplot2 package and can seamlessly handle both R data.frame and Spark DataFrame with no modifications to the original API. When invoked, a plot function in ggplot2.SparkR first checks the type of input data. If the input type is Spark DataFrame, heavyweight data processing stages are offloaded to the Spark backend using the SparkR API, and the final results will be collected and coerced into an R data.frame. Otherwise, the input data will go through the original data processing path of ggplot2 on RVM. Finally, a common backend stage for plotting will draw the graph to preserve the same look-and-feel for both cases. ggplot2.SparkR requires no additional training for existing R users who are already familiar with ggplot2 and allows them to benefit from powerful distributed processing capabilities of Spark for efficient visualization of big data. To demonstrate this we plan to show a demo with a detailed comparison between ggplot2 and ggplot2.SparkR graphics.