IoT is rapidly emerging as a leading area for Apache Spark applications. In this talk, I describe our experience and lessons learned leveraging Apache Spark for real-time IoT applications. I start by describing our data analysis pipeline, where real-time streams are collected from edge devices, gateways, or other clouds, and then processed by Spark Streaming applications, which in turn generate derived streams for further processing, data aggregates, or trigger other real-time events. I illustrate how Spark’s capability of using similar code for both stream and batch processing can simplify a number of data management issues.
Following that, I examine some of Apache Spark’s tradeoffs and challenges: e.g., in implementing functions that rely on the order of streaming data (particularly between micro batches in Spark Streaming); in ensuring idempotency for actions during recovery or data backfill that have external effects; or for supporting cost-effective multi-tenancy. In addition, unlike many Internet and Web services, where data is generated and processed all within the same datacenter, IoT applications face unique problems related to their widely distributed, resource-constrained device endpoints. Sensors and devices have differing duty cycles and intermittent connectivity, so data is often delayed or misaligned, complicating correctness in Spark Streaming applications. I discuss what we’ve done to overcome these issues. These insights arise from our experience at iobeam implementing Apache Spark as a key part of our hosted data analysis and stream processing platform for IoT. Rather than requiring developers to solve these data management headaches in an ad-hoc fashion, the iobeam platform simplifies operations by handling these common, challenging scenarios.
Mike Freedman is the Co-Founder/CEO of iobeam, a data analysis platform for IoT. iobeam makes it easy to deploy applications on data from connected devices, seamlessly handling scalability, reliability, security, and other infrastructure challenges. He's also a Professor of Computer Science at Princeton University. His research broadly focuses on distributed systems, networking, and security. He built and operated CoralCDN, and previously co-founded Illuminics Systems (acquired by Quova in 2006). His research at Stanford helped form the basis for the OpenFlow / software-defined networking (SDN) architecture. Honors include a Presidential Early Career Award (PECASE), Sloan Fellowship, NSF CAREER Award, ONR Young Investigator Award, and DARPA CSSG membership.