Approximate Computing for Stream Analytics in Apache Spark

Approximate computing has recently emerged as a promising computing paradigm which allows making a systematic trade-off between the output accuracy and computation efficiency. Approximate computing is based on the observation that for many practical applications it is acceptable to approximate rather than produce exact output results. The idea behind approximate computing is to compute over a partial subset instead of the entire input data to achieve efficient execution. Unfortunately, state-of-the-art systems for approximate computing, such as BlinkDB and ApproxHadoop, are primarily geared towards batch analytics, where the input data remains unchanged during the course of sampling. Thus, these state-of-the-art systems cannot be deployed in the context of stream analytics where new data continuously arrives as an unbounded stream. In this talk, we will present the design of StreamApprox, a Spark-based stream analytics system for approximate computing. StreamApprox implements an online stratified reservoir sampling algorithm in Spark Streaming to produce approximate output with rigorous error bounds.
Session hashtag: #EUres5

About Do Quoc Le

Do is a Ph.D. student at the Systems Engineering Group of TU Dresden co-supervised by Prof. Dr. Christof Fetzer and Prof. Dr. Pramod Bhatotia. His research interests include big data analytics, approximate computing, and distributed systems. During his Ph.D., he's been lucky to have fruitful internship/collaboration with Bell Labs. Prior to joining TU Dresden, he received his Masters degree in computer science from Pohang University of Science and Technology (POSTECH), Korea in 2012 under the supervision of Prof. Dr. James Won-Ki Hong. He also worked at the R&D center of DASAN Networks company, Seoul, Korea after receiving his Masters degree.