Pre-aggregation is a powerful analytics technique as long as the measures being computed are reaggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable. Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS. This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches. We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity. We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.
Sim Simeonov is the founding CTO of Swoop, a startup that brings the power of search advertising to content. Previously, Sim was the founding CTO at Ghostery, the platform for safe & fast digital experiences, and Thing Labs, a social media startup acquired by AOL. Earlier, Sim was vice president of emerging technologies and chief architect at Macromedia (now Adobe) and chief architect at Allaire, one of the first Internet platform companies. He blogs at blog.simeonov.com, tweets as @simeons and lives in the Greater Boston area with his wife, son and an adopted dog named Tye.