Ruhollah Farchtchi is Chief Technologist and Vice President of Zoomdata Labs at Zoomdata. He has over 15 years experience in enterprise data management, architecture and systems integration. Prior to Zoomdata, Ruhollah held management positions at BearingPoint, Booz-Allen and Unisys. He holds an M.S. in Information Technology from George Mason University.
Much of the discussion on real-time data today focuses on the machine processing of that data. But helping humans visualize real-time streams is just as important. Visualizing real-time data introduces new UX and usability challenges for any developer embedding analytics into applications, especially when the target end users are business users and not data scientists. Self-service, interactive, subsecond response time to ad hoc queries - these are the new UX requirements for any enterprise visualizing real-time data. Streaming data also lends itself to new paradigms of interaction with the stream itself, like being able to pause, rewind and replay a stream. This talk is a case study in how and why Zoomdata built a "Data DVR" capability using Spark and Spark Streaming. We will describe the required user experience, the overall architecture and the specific use of Spark and Spark Streaming. We will describe the design considerations that led us to choose Spark Streaming over alternatives like Storm. We will show how end users configure the real-time increment and a historical retention window without writing any code themselves. We will also show how pause, rewind, replay is implemented in Spark and how the solution supports both real-time and historical analysis in the same architecture. Attendees will walk away with knowledge of Spark Streaming and how users can interactively work with streaming data. They will develop familiarity with the challenges of a lambda architecture and providing a consistent analytic experience over streaming and historical data.
One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows.