Barzan Mozafari is an Assistant Professor of Computer Science and Engineering at the University of Michigan, Ann Arbor, where he leads a research group designing the next generation of scalable databases using advanced statistical models. Prior to that, he was a Postdoctoral Associate at MIT. He earned his Ph.D. in Computer Science from UCLA in 2011. His research career has led to many successful open-source projects, including CliffGuard (the first robust framework for database tuning), DBSeer (the first automated database diagnosis tool), and BlinkDB (the first massively parallel approximate query engine). More: http://web.eecs.umich.edu/~mozafari/files/mozafari-cv.pdf
Spark 2.0 ushers in huge advances to support real time use cases. With "structured streaming", a single unified API enables querying and combines streams and static data frames. Most streaming applications are stateful computations that need to build and maintain state incrementally. For instance, one application may maintain counters or incrementally build a predictive model. Another application may derive insight by correlating coarse grained stream windows with historical data in order to detect anomalies. The model state may undergo changes and must be transactionally consistent. While Spark's new API for state management in streams is promising, it isn't designed to maintain, say, a complex model or reference data from enterprise RDBs. Many applications will continue to use external stores (SQL, NoSQL or in-memory DB). However, for many scenarios (e.g., in IoT) this approach can be challenging, due to excessive serialization/deserialization, slow scan/aggregation performance in row-oriented database, and the difficulty in enforcing exactly-once semantics. Without sufficient care, an application may easily fail to keep up with the incoming stream. In this talk, we will walk through a few common use case patterns ingesting streams via the new "structured streaming" APIs and study the different options for managing state - Spark 2.0's new streaming state API, using external in-memory/NoSQL stores, or an in-memory database that runs collocated with the spark executors (i.e., sharing the same memory space). We will discuss the pros and cons of each approach and share our experimental benchmark results. We will also compare this approach to other popular stream processing approaches - Samza, Storm and Flink. Finally, we explore the benefits of using probabilistic data structures to manage infinite streams and approximation strategies for real-time analytics over with high-velocity streams using limited resources.