Jags is CTO for Snappydata – a spark based startup. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire. He helped lead the company strategy for data services, and worked closely with customers to help them be successful. Jags is recognized for his expertise in distributed systems and databases and is a frequent speaker on “distributed data”. He has a Bachelors degree in computer science and a masters degree in management.
Spark 2.0 ushers in huge advances to support real time use cases. With "structured streaming", a single unified API enables querying and combines streams and static data frames. Most streaming applications are stateful computations that need to build and maintain state incrementally. For instance, one application may maintain counters or incrementally build a predictive model. Another application may derive insight by correlating coarse grained stream windows with historical data in order to detect anomalies. The model state may undergo changes and must be transactionally consistent. While Spark's new API for state management in streams is promising, it isn't designed to maintain, say, a complex model or reference data from enterprise RDBs. Many applications will continue to use external stores (SQL, NoSQL or in-memory DB). However, for many scenarios (e.g., in IoT) this approach can be challenging, due to excessive serialization/deserialization, slow scan/aggregation performance in row-oriented database, and the difficulty in enforcing exactly-once semantics. Without sufficient care, an application may easily fail to keep up with the incoming stream. In this talk, we will walk through a few common use case patterns ingesting streams via the new "structured streaming" APIs and study the different options for managing state - Spark 2.0's new streaming state API, using external in-memory/NoSQL stores, or an in-memory database that runs collocated with the spark executors (i.e., sharing the same memory space). We will discuss the pros and cons of each approach and share our experimental benchmark results. We will also compare this approach to other popular stream processing approaches - Samza, Storm and Flink. Finally, we explore the benefits of using probabilistic data structures to manage infinite streams and approximation strategies for real-time analytics over with high-velocity streams using limited resources.