Now principle eng mgr leading a team working on Office365’s monitoring infrastructure and its advanced analytics pipeline based on Spark/Kafka/Cassandra. The goal is to enable intelligent monitoring/analytics for Office365’s service quality and user experience by using Spark as its primary big data computing engine and integrating Microsoft’s years of experience in ML and advanced streaming analytics. Prior to Microsoft, was senior engineering manager in Autodesk working on its cloud platform with one focus to create its real-time service monitoring infrastructure by Spark.
While building intelligent monitoring and alerting system for Office365 service quality and user experience on top of Spark Streaming, the requirement is to use event application time for the majority of our monitoring logic -mostly aggregates and temporal joins over different type of events windows for repeatability and cross signal correlation. The native Spark Streaming only supports wall-clock windowing operators, which is insufficient for most of our scenarios. Therefore Office365 team and Azure Streaming Analytics team have been working together to create a set of temporal operators (e.g. reorder, aggregate, temporal joins all by event application time) on top of Spark Streaming to fulfill our complex monitoring logic at scale. Azure Streaming Analytics team have been working for years for advanced streaming programming models and implementations while Office365 team has strong need to scale its monitoring/alerting infrastructure for service quality and user experience by leveraging open source stack (Kafka/Spark/Cassandra). In this presentation, we'll present the core concepts and streaming programming model of the temporal operators, its intuitive APIs and implementation by using Spark Streaming's native operators, the operator composition mechanism to reduce I/O cost and maximize performance, and its applications and production pipeline operations against huge volume of Office365 service signals. Key takeways include - understanding of temporal streaming programming model we are proposing and its primary usage scenarios, how we implement this temporal streaming model on top of spark streaming in an intuitive way and the challenges and lessons learned in running/tuning/optimizing the production pipeline against large volume of streaming data in Office365 online services.