Sangoh Jeong

Senior Manager, SK Telecom

Sangoh Jeong works for SK Telecom (SKT) in Korea, where he’s in charge of a SparkR project and is involved in other projects related to Operational Intelligence for Cloud systems. He got his Ph.D. in Electrical Engineering from Stanford University in 2006. Prior to joining SKT, he worked for Samsung Information Systems America, HP Labs., LookSmart, Ricoh Innovations in California, and LG Electronics, Samsung Electronics in Korea. His research interests include machine learning, Big Data analytics, IoT, and computer vision. He’s also interested in Spark MLlib.


ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization

ggplot2 is one of the most popular data visualization packages for R, which makes it easy to produce high-quality graphs using data represented in R data.frame. However, ggplot2 is not suitable for big data visualization for the following reasons. First, the maximum data size it can handle is limited by the physical memory size since R Virtual Machine (RVM) attempts to keep the entire data.frame in memory. Second, even if the data set fits in memory, it often takes a long time to import it from a file, partly due to the overhead of format conversion. Finally, ggplot2 does not effectively utilize abundant computing resources offered by today's parallel/distributed machines as the package itself is not parallelized. In this presentation, we introduce ggplot2.SparkR, an R package for scalable visualization of big data represented in Spark DataFrame. ggplot2.SparkR is an extension to the original ggplot2 package and can seamlessly handle both R data.frame and Spark DataFrame with no modifications to the original API. When invoked, a plot function in ggplot2.SparkR first checks the type of input data. If the input type is Spark DataFrame, heavyweight data processing stages are offloaded to the Spark backend using the SparkR API, and the final results will be collected and coerced into an R data.frame. Otherwise, the input data will go through the original data processing path of ggplot2 on RVM. Finally, a common backend stage for plotting will draw the graph to preserve the same look-and-feel for both cases. ggplot2.SparkR requires no additional training for existing R users who are already familiar with ggplot2 and allows them to benefit from powerful distributed processing capabilities of Spark for efficient visualization of big data. To demonstrate this we plan to show a demo with a detailed comparison between ggplot2 and ggplot2.SparkR graphics.