Rashmina Menon is a Senior Data Engineer with GumGum, which is a Computer Vision company. She’s passionate about building distributed and scalable systems and end-to-end data pipelines that provide visibility to meaningful data through machine learning and reporting applications.
GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. Inventory impression is the real estate to show potential ads on a publisher page. By generating near-real-time inventory forecast based on campaign-specific targeting rules, GumGum enables the account managers to set up successful future campaigns. This talk will highlight the data pipelines and architecture that help the company achieve a forecast response time of less than 30 seconds for this scale. Spark jobs efficiently sample the inventory impressions using AMIND sampling and write to Delta Lake. We will discuss the best practices and techniques to make efficient use of Delta Lake. GumGum caches the data on the cluster using Databricks Delta caching, which supports accelerated reads, reducing IO time as much as possible, and this talk will detail the advantages of Delta Lake caching over conventional Spark caching. We will talk about how GumGum enables time series forecasting with zero downtime for end users using auto ARIMA and sinusoids that can capture the trends in the inventory data, and will cover in detail AMIND sampling, Delta Lake to store the sampled data, Databricks Delta Lake caching for efficient reads and cluster use, and time series forecasting.