The Weather Company (TWC) collects weather data across the globe at the rate of 34 million records per hour, and the TWC History on Demand application serves that historical weather data to users via an API, averaging 600,000 requests per day. Users are increasingly consuming large quantities of historical data to train analytics models, and require efficient asynchronous APIs in addition to existing synchronous ones which use ElasticSearch. We present our architecture for asynchronous data retrieval and explain how we use Spark together with leading edge technologies to achieve an order of magnitude cost reduction while at the same time boosting performance by several orders of magnitude and tripling weather data coverage from land only to global.
We use IBM Cloud SQL Query, a serverless SQL service based on Spark, which supports a library of built-in geospatial analytics functions, as well as (geospatial) data skipping, which uses metadata to skip over data irrelevant to queries using those functions. We cover best practices we applied including adoption of the Parquet format, use of a multi-tenant Hive Metastore in SQL Query, continuous ingestion pipelines and periodic geospatial data layout and indexing, and explain the relative importance of each one. We also explain how we implemented geospatial data skipping for Spark, cover additional performance optimizations that were important in practice, and analyze the performance acceleration and cost reductions we achieved.
Speakers: Paula Ta-Shma and Erik Goepfert
Dr. Paula Ta-Shma is a Research Staff Member in the Cloud & Data Technologies group at IBM Research - Haifa and is responsible for a group of research efforts in the area of Hybrid Data, with a particular focus on high performance, secure and cost efficient data stores and processing engines. She is particularly interested in performant SQL analytics over Object Storage and leads work on Data Skipping whose work is now integrated into multiple IBM products and services. Previously she led projects in areas such as cloud storage infrastructure for IoT and Continuous Data Protection. Prior to working at IBM Dr. Ta-Shma worked at several companies on Database Management Systems including Informix Software Inc. where she worked on Apache Derby. She holds M.Sc. and PhD degrees in computer science from the Hebrew University of Jerusalem.
The Weather Company, an IBM Business
Erik Goepfert is a Senior Software Engineer for IBM, focusing primarily on historical weather solutions as part of The Weather Company, an IBM Business. At The Weather Company, he works on the History on Demand application which ingests large amounts of weather data and serves it to users via an API. The data is then used by clients primarily for machine learning and data analytics. Previously he worked in the transportation industry, writing software to integrate mobile LiDAR, 3D pavement technology, imaging, and geospatial data collection equipment and data processing software for Mandli Communications. Erik received his Bachelor’s degree in Computer Science from University of Wisconsin, Milwaukee.