Geospatial data is pervasive, and spatial context is a very rich signal of user intent and relevance in search and targeted advertising and an important variable in many predictive analytics applications. For example when a user searches for “canyon hotels”, without location awareness the top result or sponsored ads might be for hotels in the town “Canyon, TX”. However, if they are are near the Grand Canyon, the top results or ads should be for nearby hotels. Thus a search term combined with location context allows for much more relevant results and ads. Similarly a variety of other predictive analytics problems can leverage location as a context. To leverage spatial context in a predictive analytics application requires us to be able to parse these datasets at scale, join them with target datasets that contain point in space information, and answer geometrical queries efficiently. In this talk, we describe the motivation and the internals of an open source library that we are building for Geospatial Analytics using Spark SQL, DataFrames and Catalyst as the underlying engine. We outline how we leverage Catalyst’s pluggable optimizer to efficiently execute spatial joins, how SparkSQL’s powerful operators allow us to express geometric queries in a natural DSL, and discuss some of the geometric algorithms that we implemented in the library. We also describe the Python bindings that we expose, leveraging Pyspark’s Python integration.
I am the Product Manager for open source efforts at Databricks. Prior experience includes Spark and Data Science Architect at Hortonworks, Principal Research Scientist at Yahoo focused on large scale data mining and machine learning for search and display advertising. I am an Apache Spark PMC Member and Committer.