Erik Goepfert is a Senior Software Engineer for IBM, focusing primarily on historical weather solutions as part of The Weather Company, an IBM Business. At The Weather Company, he works on the History on Demand application which ingests large amounts of weather data and serves it to users via an API. The data is then used by clients primarily for machine learning and data analytics. Previously he worked in the transportation industry, writing software to integrate mobile LiDAR, 3D pavement technology, imaging, and geospatial data collection equipment and data processing software for Mandli Communications. Erik received his Bachelor’s degree in Computer Science from University of Wisconsin, Milwaukee.
November 17, 2020 04:00 PM PT
The Weather Company (TWC) collects weather data across the globe at the rate of 34 million records per hour, and the TWC History on Demand application serves that historical weather data to users via an API, averaging 600,000 requests per day. Users are increasingly consuming large quantities of historical data to train analytics models, and require efficient asynchronous APIs in addition to existing synchronous ones which use ElasticSearch. We present our architecture for asynchronous data retrieval and explain how we use Spark together with leading edge technologies to achieve an order of magnitude cost reduction while at the same time boosting performance by several orders of magnitude and tripling weather data coverage from land only to global.
We use IBM Cloud SQL Query, a serverless SQL service based on Spark, which supports a library of built-in geospatial analytics functions, as well as (geospatial) data skipping, which uses metadata to skip over data irrelevant to queries using those functions. We cover best practices we applied including adoption of the Parquet format, use of a multi-tenant Hive Metastore in SQL Query, continuous ingestion pipelines and periodic geospatial data layout and indexing, and explain the relative importance of each one. We also explain how we implemented geospatial data skipping for Spark, cover additional performance optimizations that were important in practice, and analyze the performance acceleration and cost reductions we achieved.
Speakers: Paula Ta-Shma and Erik Goepfert