Large Scale Geospatial Indexing and Analysis on Apache Spark

SafeGraph is a data company — just a data company — that aims to be the source of truth for data on physical places. We are focused on creating high-precision geospatial data sets specifically about places where people spend time and money. We have business listings, building footprint data, and foot traffic insights for over 7 million across multiple countries and regions.

In this talk, we will inspect the challenges with geospatial processing, running at a large scale. We will look at open-source frameworks like Apache Sedona (incubating) and its key improvements over conventional technology, including spatial indexing and partitioning. We will explore spatial data structure, data format, and open-source indexing like H3. We will illustrate how all of these fit together in a cloud-first architecture running on Databricks, Delta, MLFlow, and AWS. We will explore examples of geospatial analysis with complex geometries and practical use cases of spatial queries. Lastly, we will discuss how this is augmented by Machine Learning modeling, Human-in-the-loop (HITL) annotation, and quality validation.

About Felix Cheung

Felix is the VP of Engineering at SafeGraph, bringing over 20 years of engineering and 7 years of data experience. He led teams in Uber's Data Platform and was pivotal in rebuilding their open-source program. Previously he spent time at Microsoft and startups. Felix is a strong proponent of open-source; as a Member of the Apache Software Foundation, he works on Apache Spark (data), Apache Zeppelin (notebook), and also helps mentor 6 projects in the Apache Incubator, including geospatial project Apache Sedona, and leading Apache Superset (visualization) to graduate.