In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. Algorithms that can produce accuracy guaranteed approximate answers for these problem queries are a required toolkit for modern analysis systems that need to process massive amounts of data quickly. For interactive queries there may not be other viable alternatives, and in the case of realÂ-time streams, these specialized algorithms, called stochastic, streaming, sublinear algorithms, or ‘sketches’, are the only known solution. This technology has helped Yahoo successfully reduce data processing times from days to hours or minutes on a number of its internal platforms and has enabled subsecond queries on real-time platforms that would have been infeasible without sketches. This talk provides a short introduction to sketching and to DataSketches, an open source library of a core set of these algorithms designed for large production analysis and AI systems.
Lee Rhodes is a Distinguished Architect at Verizon Media (Yahoo). In 2012, Lee started the DataSketches project, which has been widely adopted into many of Yahoo's data analysis systems. In October, 2015, the DataSketches project was open-sourced, and is now being migrated to the Apache Software Foundation as a top-level project dedicated to production quality sketch implementations. Lee's education background includes MS EE from Stanford and a bachelor's degree in Physics. Lee has been awarded over 15 patents and a co-author of some key papers in the field of streaming algorithms.