During the last year we have been using Apache Spark to perform analytics on large volumes of sensor data. These applications need to be executed on a daily basis, therefore, it was essential for us to understand Spark resource utilization. We found it cumbersome to manually consume and efficiently inspect the CSV files for the metrics generated at the Spark worker nodes. Although using an external monitoring system like Ganglia would automate this process, we were still plagued with the inability to derive temporal associations between system-level metrics (e.g. CPU utilization) and job-level metrics (e.g. job or stage ID) as reported by Spark. For instance, we were not able to trace back the root cause of a peak in HDFS Reads or CPU usage to the code in our Spark application causing the bottleneck. To overcome these limitations we developed SparkOscope. Taking advantage of the job-level information available through the existing Spark Web UI and to minimize source-code pollution, we use the existing Spark Web UI to monitor and visualize job-level metrics of a Spark application (e.g. completion time). More importantly, we extend the Web UI with a palette of system-level metrics of the server/VM/container that each of the Spark job’s executor ran on. Using SparkOScope, the user can navigate to any completed application and identify application-logic bottlenecks by inspecting the various plots providing in-depth timeseries for all relevant system-level metrics related to the Spark executors, while also easily associating them with stages, jobs and even source code lines incurring the bottleneck. Github: https://github.com/ibm-research-ireland/sparkoscope Demo: https://ibm.box.com/s/vyaedlyb444a4zna1215c7puhxliqxdg
Yiannis Gkoufas works as a Research Software Engineer in IBM Research and Development in Dublin since December 2012. He received his Bachelor's and Master's degrees in Athens University of Economics and Business. He has been working mainly with Java-based technologies on the backend. In the past few years he has been exploring Hadoop-related frameworks for large batch data processing.