Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up with the Jones’ (and save the planet)

Download Slides

Analyzing and comparing your energy consumption with that of other consumers provides healthy peer pressure and useful insight leading to energy conservation and impacting the bottom line. We helped GridPocket (http://www.gridpocket.com/), a smart grid company developing energy management applications for electricity water and gas utilities, implement high scale anonymized energy comparison queries with an order of magnitude lower cost and higher performance than was previously possible. IoT use cases like that of GridPocket are swamping our planet with data, and drive demand for analytics on extremely scalable and low cost storage. Enter Spark SQL over Object Storage: highly scalable and low cost storage which provides RESTful APIs to store and retrieve objects and their metadata. Key performance indicators (KPIs) of query performance and cost are the number of bytes shipped from Object Storage to Spark and the number of incurred REST requests. We propose Pluggable Spark SQL Filters, which extend the existing Spark SQL partitioning mechanism with an ability to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.), and unlike pushdown compatible formats such as Parquet which require touching each object to determine its relevance, it avoids accessing irrelevant objects altogether. We developed a pluggable interface for developing and deploying Filters, and implemented GridPocket’s filter which screens objects according to their metadata, for example geo-spatial bounding boxes which describe the area covered by an object’s data points. This leads to drastically lower KPIs since there is no need to ship the entire dataset from Object Storage to Spark if you are only comparing yourself with your neighborhood. We demonstrate GridPocket analytics notebooks, report on our implementation and resulting 10-20x speedups, explain how to implement a Pluggable File Filter, and how we applied this to other use cases.
Session hashtag: #EUres2

About Paula Ta-Shma

Dr. Paula Ta-Shma is a Research Scientist in the IBM Cloud and Data Technologies group. She is currently working on cloud storage infrastructure for the Internet of Things, and leads several related research efforts. She led IBM efforts in the EU funded COSMOS project as well as various other research projects such as Continuous Data Protection. Her work has been presented at multiple industry conferences including the Apache Spark Summit, the OpenStack summit and IBM InterConnect, as well as academic conferences such as FAST and SYSTOR. She holds M.Sc. and PhD degrees in computer science from the Hebrew University.

About Guy Gerson

Guy Gerson is a Research Staff Member in the IBM Cloud and Data Technologies group. He has been working on adoption of cloud storage systems as part of large scale Internet of Things analytics architectures based on Spark. His work includes design of smart end-to-end data pipelines, targeting to minimize cluster resource consumption and overall processing time. Guy holds a B.Sc. degree in computer science from the Technion – Israel Institute of Technology.