How to Process IoT Device JSON Data Using Apache Spark Datasets and DataFrames

Published: March 28, 2016

Today, I joined Databricks, the company behind Apache Spark, as a Spark Community Evangelist. In the past, I've worked as an individual contributor at various tech companies in numerous engineering roles; and more recently, as a developer and community advocate.

With immense pride and pleasure—and at a pivotal juncture in Spark's trajectory as the most active open source Apache project—I take on this new role at Databricks, where embracing our growing Spark community is paramount, contributions from the community are valued, and keeping Spark simple and accessible for everyone is sacred.

Pursuing simplicity and ubiquity

"Spark is a developer's delight" is a common refrain heard among Spark's developer community. Since its inception the vision—the guiding North Star—to make big data processing simple at scale has not faded. In fact, each subsequent release of Apache Spark, from 1.0 to 1.6, seems to have adhered to that guiding principle—in its architecture, in its consistency and parity of APIs across programming languages, and in its unification of major library components built atop the Spark core that can handle a shared data abstraction such as RDDs, DataFrames, or Datasets.

Since Spark's early days, its creators embraced Alan Kay's principle that "simple things should be simple, complex things possible." Not surprisingly, the Spark team articulated and reiterated that commitment to the community at the Spark Summit NY, 2016: the keynotes and the release road map attest to that vision of simplicity and accessibility, so everyone can get the "feel of Spark."

And for us to get that "feel of Spark," this notebook demonstrates the ease and simplicity with which you can use Spark on Databricks, without need to provision nodes, without need to manage clusters; all done for you, all free with Databricks Community Edition.

With DataFrames (introduced in 1.3) and Datasets (previewed in 1.6), in this notebook I use both sets of APIs to show how you can quickly process structured data (JSON) with an inherent and inferred schema, intuitively compose relational expressions, and finally issue Spark SQL queries against a table. By using notebook's myriad plotting options, you can visualize results for presentation and narration. Even better, you can save these plots as dashboards.

In this second part, I have augmented the device dataset to include additional attributes, such as GeoIP locations, an idea borrowed from AdTech Sample Notebook, as well as additional device attributes on which we can log alerts, for instance device_battery levels or C02 levels. I uploaded close to 200K devices, curtailing from original 2M entries, as a smaller dataset for rapid prototyping.

Again, all code is available on GitHub: