Paula Ta-Shma

Research Staff Member, IBM

Dr. Paula Ta-Shma is a Research Staff Member in the Cloud & Data Technologies group at IBM Research – Haifa and is responsible for a group of research efforts in the area of Hybrid Data, with a particular focus on high performance, secure and cost efficient data stores and processing engines. She is particularly interested in performant SQL analytics over Object Storage and leads work on Data Skipping whose work is now integrated into multiple IBM products and services. Previously she led projects in areas such as cloud storage infrastructure for IoT and Continuous Data Protection. Prior to working at IBM Dr. Ta-Shma worked at several companies on Database Management Systems including Informix Software Inc. where she worked on Apache Derby. She holds M.Sc. and PhD degrees in computer science from the Hebrew University of Jerusalem.

Past sessions

The Weather Company (TWC) collects weather data across the globe at the rate of 34 million records per hour, and the TWC History on Demand application serves that historical weather data to users via an API, averaging 600,000 requests per day. Users are increasingly consuming large quantities of historical data to train analytics models, and require efficient asynchronous APIs in addition to existing synchronous ones which use ElasticSearch. We present our architecture for asynchronous data retrieval and explain how we use Spark together with leading edge technologies to achieve an order of magnitude cost reduction while at the same time boosting performance by several orders of magnitude and tripling weather data coverage from land only to global.

We use IBM Cloud SQL Query, a serverless SQL service based on Spark, which supports a library of built-in geospatial analytics functions, as well as (geospatial) data skipping, which uses metadata to skip over data irrelevant to queries using those functions. We cover best practices we applied including adoption of the Parquet format, use of a multi-tenant Hive Metastore in SQL Query, continuous ingestion pipelines and periodic geospatial data layout and indexing, and explain the relative importance of each one. We also explain how we implemented geospatial data skipping for Spark, cover additional performance optimizations that were important in practice, and analyze the performance acceleration and cost reductions we achieved.

Speakers: Paula Ta-Shma and Erik Goepfert

COSMOS is a platform for developing IoT applications focusing on smart city use cases, ranging from intelligent transportation systems to smart energy management. A central challenge is to analyze large historical datasets from heterogeneous IoT devices and provide near real-time solutions. COSMOS meets this challenge using a generic integration of multiple Spark libraries and other open source components. Spark MLlib algorithms such as clustering and regression are utilized to gain insight from the data and provide intelligent, proactive solutions. The ML analysis accesses historical data via Spark SQL - this data is continuously collected, annotated with metadata and stored in an OpenStack Swift archive in Parquet format. Data access is optimized using the Spark SQL Data Frame APIs which support projection and selection pushdown, and we allow more selective pushdown than partitioning based approaches by implementing metadata search for Swift and using it for selection pushdown. The generic nature and wide applicability of our component integration patterns is demonstrated by two IoT use-case scenarios. The first involves the Madrid transportation system, where traffic data from over 3000 fixed monitoring locations is available. We analyze the historical data using k-means clustering and provide parameters for a CEP engine to infer complex events such as congestion or bad traffic in near real-time. We also provide a proactive approach for intelligent traffic management by predicting traffic parameters using regression mechanisms. The same integration approach is demonstrated on a second use-case scenario for smart energy management which infers office occupancy state from electricity consumption.

Analyzing and comparing your energy consumption with that of other consumers provides healthy peer pressure and useful insight leading to energy conservation and impacting the bottom line. We helped GridPocket (, a smart grid company developing energy management applications for electricity water and gas utilities, implement high scale anonymized energy comparison queries with an order of magnitude lower cost and higher performance than was previously possible. IoT use cases like that of GridPocket are swamping our planet with data, and drive demand for analytics on extremely scalable and low cost storage. Enter Spark SQL over Object Storage: highly scalable and low cost storage which provides RESTful APIs to store and retrieve objects and their metadata. Key performance indicators (KPIs) of query performance and cost are the number of bytes shipped from Object Storage to Spark and the number of incurred REST requests. We propose Pluggable Spark SQL Filters, which extend the existing Spark SQL partitioning mechanism with an ability to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.), and unlike pushdown compatible formats such as Parquet which require touching each object to determine its relevance, it avoids accessing irrelevant objects altogether. We developed a pluggable interface for developing and deploying Filters, and implemented GridPocket's filter which screens objects according to their metadata, for example geo-spatial bounding boxes which describe the area covered by an object's data points. This leads to drastically lower KPIs since there is no need to ship the entire dataset from Object Storage to Spark if you are only comparing yourself with your neighborhood. We demonstrate GridPocket analytics notebooks, report on our implementation and resulting 10-20x speedups, explain how to implement a Pluggable File Filter, and how we applied this to other use cases.
Session hashtag: #EUres2