Pre-aggregation is a powerful analytics technique as long as the measures being computed are aggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not aggregable. Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS.
This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches. We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity. We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.
Sim Simeonov is an entrepreneur, investor and startup mentor. He is the founding CTO of Swoop and IPM.ai, startups that use privacy-preserving AI to improve patient outcomes. Previously, Sim was the founding CTO of Evidon (CrownPeak) & Thing Labs (AOL) and a founding investor in Veracode (Broadcom). Before that, Sim was a venture capitalist at General Catalyst and Polaris Partners where he helped start five companies the firms invested in. Before his days as an investor, Sim was vice president of emerging technologies and chief architect at Macromedia (now Adobe).