One of the biggest challenges in data science is to build a continuous data application which delivers results rapidly and reliably. Spark Streaming offers a powerful solution for real-time data processing. However, the challenge remains in how to connect them with various continuous and real-time data sources, guaranteeing the responsiveness and reliability of data applications.
In this talk, Nan and Arijit will summarize their experiences learned from serving the real-time Spark-based data analytic solutions on Azure HDInsight. Their solution seamlessly integrates Spark and Azure EventHubs which is a hyper-scale telemetry ingestion service enabling users to ingress massive amounts of telemetry into the cloud and read the data from multiple applications using publish-subscribe semantics.
They’ll will cover three topics: bridging the gap of data communication model in Spark and data source, accommodating Spark to rate control and message addressing of data source, and the co-design of fault tolerance Mechanisms. This talk will share the insights on how to build continuous data applications with Spark and boost more availabilities of connectors for Spark and different real-time data sources.
Session hashtag: #SFdev12
Nan Zhu is a Software Engineer from Microsoft, where he works on serving Spark Streaming/Structured Streaming on Azure HDInsight. He is a contributor of Apache Spark (known as CodingCat) and also serves as the committee member of Distributed Machine Learning Community (DMLC) and Apache MxNet (incubator).
Arijit Tarafdar is a Senior Software Engineer with the Big Data group in Microsoft. He is currently working on bringing Spark to different Azure Cloud services and platforms. Prior to this he was with the High Performance Computing team in Microsoft where he worked on productizing DryadLinQ. He has worked on large scale distributed systems and published original research on graph coloring algorithms.