Large Scale Geospatial Indexing and Analysis on Apache Spark

May 26, 2021 05:00 PM (PT)

Download Slides

SafeGraph is a data company — just a data company — that aims to be the source of truth for data on physical places. We are focused on creating high-precision geospatial data sets specifically about places where people spend time and money. We have business listings, building footprint data, and foot traffic insights for over 7 million across multiple countries and regions.

In this talk, we will inspect the challenges with geospatial processing, running at a large scale. We will look at open-source frameworks like Apache Sedona (incubating) and its key improvements over conventional technology, including spatial indexing and partitioning. We will explore spatial data structure, data format, and open-source indexing like H3. We will illustrate how all of these fit together in a cloud-first architecture running on Databricks, Delta, MLFlow, and AWS. We will explore examples of geospatial analysis with complex geometries and practical use cases of spatial queries. Lastly, we will discuss how this is augmented by Machine Learning modeling, Human-in-the-loop (HITL) annotation, and quality validation.

In this session watch:
Felix Cheung, VP of Engineering, SafeGraph

 

Transcript

Felix Cheung: Hello, everyone. Welcome to my talk. Today, I’m going to be talking about large scale geospatial indexing and analysis on Apache Spark. I’m Felix, VP engineering here at SafeGraph, a data company.
So about myself, like I said, I’m VP engineering at SafeGraph. Before then I was at Uber running a bunch of data platform teams. I’m currently deeply involved in Apache software foundation, I’m a member there. I’m also part of project committee for Apache Spark, Zeppelin, Superset and Apache Incubator. I am, among other things, a mentor of the Apache Sedona incubating project, which I’m going to be covering today. Aside from that, I’m also mentoring another six projects.
So today, we’re going to be diving into a number of different topics. To begin with, we’re going to talk about geospatial data, and then we’re going to dive somewhat deeper into the distributed processing methodology. And next, we’re going to move on to a number of use cases. We’re going to cover the [inaudible], how that could be implemented. And then we’re going to wrap up this conversation by talking a little bit about the overall architecture here at SafeGraph.
So first, geospatial. So SafeGraph is just a data company. It’s fully remote. It was founded in 2014, and the founders have deep connection and experience with data and privacy. SafeGraph is a data company. Our mission here is to focus on powering innovation through open access to geospatial data. And we aim to be the source of truth for physical places.
One of the key product we have, SafeGraph Patterns, provides a window into customer or consumer behavior. It has a very accurate and aggregated foot-traffic view by anonymizing millions and millions of devices. Along with it, we have eight million points of interest of places and businesses where people are spending time. And lastly, we provide [inaudible] in an easy to consume form, and especially as CSV file download, and among other different ways that we support.
When we think about SafeGraph products, there are generally three different suites. On the left hand side, you see there are core places. These are the places where people are spending time or money. And then in the middle, you see geometry, which is the building footprint and a spacial hierarchy and relationship between these businesses that the bounding polygon around these businesses. And then lastly, but not the lease on the on the right-hand side, you see the patterns product, which as I described are about the foot-traffic panel and all of these can be joined by a concept called PlaceKey with third party data, which I’m going to cover a little bit later today.
So what the common use cases with something like this? So on the left hand side, you see very popular use cases around equity due diligence. And then moving further on the right, you see retail and real estate and site selection is like extremely popular. In addition to that, you can also have like competitive intelligence, right? Looking at who is popular, what businesses are your competitor in this space. And then moving further on the right, you see a little bit more of the conventional marketing advertising, including visit attribution, location-based advertising and then on the far right, you see the conventional GIS professional type of use cases.
But going further, what is geospatial data anyway? The term geospatial describes data that represent feature and objects on the surface of the earth. So these are data that has like individual record. We have locational information tied to them such as coordinate, addresses, city or postal code. Well, but you can’t really have where without what and who. So oftentimes these are all closely tied together and so you have like demographic information that is tied to the location as well. They oftentimes looked at together.
What are the key challenges of geospatial data though? Well, so for one it’s actually Earth is really large. It has over 197 million square mile of surface area. So just computing where it is could be expensive and scaling the competition is actually one of the most challenging aspect of it. And secondly, there is a lack of source of truth. There isn’t really a true set of lists of all the possible places you can find on Earth, right? And so oftentime a petitioner in this space has to go like basically crosscheck and reference everything kind of really know what is the freshest, what is the most correct dataset that can be representative of what is going on?
And what is going on is really the most interesting, the most challenging part of it. As you know, the world is changing, like things are closing, opening and is evolving pretty quickly. And oftentimes really having the right data means that you have the right fresh data. There’s also like a real word constraint you have to work with, like how far these things are. You can’t possibly have two, two businesses stacking on top of each other, right? All sorts of things that you need to consider when you’re processing geospatial data.
So now processing. What does processing mean? So there are historically a number of like a kind of tool and framework and services you can use. On the left hand side, you ArcGIS, very popular GIS tool. There’s this alternative QGIS on the right hand side. You also have, if you are a big fan of databases, right, PostGIS is like GIS extension to Postgres, very popular.
However, there are a number of limitations to it. Oftentimes these are single machine implementation, right? Even if you are connecting to it through a distributed system, it takes a lot of work to like connect to the system. There’s oftentimes a limitation on the number of connection you can have. It takes time to execute a query. There’s also limitation around it. So in the last three years, there’s been an advance of like new kind of technique and strategy in terms of processing geospatial data, one being like payload execution or using GPU to accelerate.
So today I’m going to talk about distribute execution and to do that, I want to run through a quick introduction about Apache Sedona incubating. So you started in 2015 as GeoSpark at the Arizona State University. It’s a cluster computing system for processing geospatial data by building and extending on top of Apache Spark.
What are the key components of Apache Sedona? Well, first for one, it has extension to the Spark RDD construct. It also has support for spatial SQL. So you can do spatial query with it. It support complex geometries and trajectories. It has spacial indexing, spatial partitioning. It also support a right here of coordinate reference system. Last but not the least, it also has support for high resolution map generation. Here’s an example of it. Here what you see here is a visualization of billions of objects using a component called SedonaViz. It has function like ST pixalize and ST render that generate images like this.
So what are the key advances of Sedona? Well, it’s not limited to these three I have, but I like to focus on these three in this talk. So first of all, geospatial query, indexing and partition. And with all these what we’ve seen here with Sedona is actually 2x-10x times faster execution and 50% reduction of memory consumption compared to other Spark-based geospatial systems.
So what is spatial SQL? SQL is highly popular. If you know about SQL, this is very intuitive, it comes naturally to you, it’s extension to the SQL syntax by implementing two open standard. The first one is MMC spatial three. The second one is OGC simple feature for SQL. What it also support is geometry data types, like point, lines, multiline, polygons. And with that, you can express relationship between the geometry datatypes. So what you see at the bottom here is a great example of it. “You can find all the superhero within the city of Goatham.”
Another thing I want to dive into is the spatial SQL optimization. I won’t go into the specific details here, but a number of innovation and optimization with the Sedona framework. It includes range query, join query, KNN, KNN join and also optimize like spatial join strategy. It also has poor variety of data format. So it could be well known texts, WKT, WKB, GeoJSON, Shapefile, so on and so forth. And also support geospatial geometry like specification, like polygon, like the definition as a string, all this processing.
Now we move on to spatial indexing. What is spatial indexes anyway? In simple term is you divide a surface up into these area and region, and then you build up a tree what you see at the bottom. It allows you to do a super fast look out, basically walking through the tree. And by implementing our tree and quad tree, what the framework has seen is like a substantial improvement in local performance. So if you look at the reference, which is the bar in red compared to our tree and quad tree implementation, you see that it’s basically a third of the decision time.
And then you have spatial partitioning. Like you probably figure at this point, like partitioning is essential to distributed processing. So the strategy we having here with Sedona is by kind of really focused on spatial proximity. The logic is as follows. First step, it takes a sample of the dataset, all the data points, use these sample data point in step two to build the indexing tree that we talk about just now. And then the leaf node we see at the bottom of that tree becomes a global partitioning mapping. And now a different strategy in terms of what tree you’re going to employ for the partitioning scheme. It could be quad tree, uniform grid, KDB or R-tree and so on and so forth. The below is an example of what it looks like. There are different strategies and different ways it will be useful for different use cases.
And really the goal here is to put together the spatial partitioning and indexing altogether. What that means is we’re going to start with a bigger area. In this case, we’re taking the continental US, segmented out and kind of cut it into like bounding boxes, right? And then using these bounding region, we build the high level tree, the global index that happens on the driver side of the Spark kind of job execution, right? And then further on, you’re going to be looking at each of these boxes and region, and then having that small green tree getting built on the executor itself as you execute the Spark job. So as a result, you get a hierarchical kind of spatial indexing effect here in leveraging the same tree that was used for partitioning before in the earlier step.
But this is not the only form of indexing and a common strategy is using something like H3. What is H3? H3 is a system that has created Uber a couple of years ago, it was opened source there. It is a hierarchical, like a hexagonal tiling system. The idea is as follows, you can see at the bottom here, the smaller hexagon within a bigger hexagon in red, and then all of them made up like this surface that cover the entire Earth you see on the right hand side. So why is that interesting? A really big part of this is the bucketing you can do about this. So you look at the map on the right hand side here, you see like data point, they kind of go inside the hexagon. So instead of going in individual ones like separately, you can actually bucketize them into like one hexagon and line. So you basically roll up all the numbers there, you can look at all of that together in individual hexagon.
One of the advantages of hexagon is also equal distance. So you can see in the middle here, it doesn’t matter which direction you go, it has like the exact same distance when you are moving around. So, it’s traversal, neighboring, truncation, or even fueling a region and fueling an area, right? There’s sort of different ways you can look at geospatial analysis using a system like this. So let’s talk about truncation a little more. Truncation is a great way to look at like a small hex and then basically zooming out from that area, looking at the parent hexagon and so on and so forth. So the function in H3 is called h3ToParent. Another approach is you look at something like kRing. What that means is you take the hexagon kind of in the middle, and then you kind of map out all the surrounding hexagon around it. So this is a great way to look at proximity and do like vicinity analysis. So leveraging H3 is perfect for these kind of things.
The reason why I want to talk about that too, is it’s actually used within SafeGraph as much as it’s also a foundation of the concept of PlaceKey. So I mentioned PlaceKey. The goal of the PlaceKey is to provide an easy way for any geospatial data set, to reference the same physical place, using a universal identifier to manage and handle challenges like address mismatches, fussy matching, all that stuff. And the encoding what you see on the right hand side, the where part it’s leveraging H3 index and it was encoded in this kind of language where it’s kind of easy to read and more legible.
Moving further on, we’re going to talk about use cases. So one of the primary use case here at SafeGraph is visit attribution. So you think about this particular use case here, you have a time series, GPS signal, a person or device is going to a particular store. Now, if you look at the trivial approach of doing like a intersection between radius and all these things, right, like you would imagine in the bottom case, you’re going to be wrongly attributing visit to the lower building like that lower building here at the bottom.
The way we deal with this in SafeGraph is a three step approach. Step one, we do clustering on the time series geospatial dataset that we look at. And then we apply a geospatial join to that dataset, to that set of clusters, along with the building footprints. And then lastly, we take the result of that, the join and kind of move forward with a prediction. So every single day, we have more than like two terabytes of data coming through process, using this way to compute the visit attribution. Here I’m moving on to kind of like a handcrafted implementation of this. JTS is very popular framework for geospatial purposes. The way you would do that is you feed it the WKT sort of like string, right. Basically a WKT format string. So we’re using a WKT reader here in reading that data and generate a geometry object.
And then secondly, we actually go build a SDR tree. What we do then is take all these geometry that we actually loaded and generated and insert it in tree itself, and then broadcast the tree from the driver to all the worker. And subsequently we leverage a tree that was broadcast in UDF and processing that data in that distributed structure. Now we can imagine there are a lot of challenges with that like, we’re basically taking that very big, large data set, have to build it in one spot, have to distribute it all around.
So imagine you don’t have to do that and you basically have the right SQL query, and that does the same thing for you, right? So within here, you see at the middle, basically you take two geometry and say is one within the other using ST with it. You can also do ST intersect as well if there’s not exactly like a encapsulation, but more like intersection. So you can tell it’s much easier using that MM3 extension SQL syntax. And also to add there’s also spatial join you can write. The way this works is what we described earlier. There are kind of multiple steps. First is you take the dataset and then you use the sampling to analyze it and create the tree structure. And then using that tree structure and say, “I’m going to specify a tree type I want to apply for partitioning.” And then you take the same partitioning theme and apply to the second dataset. And then lastly, you see at the bottom here is you’re doing a spatial join between these two datasets.
And another interesting use cases we have and important use case we have is computing geometry overlap. So what we do is take a high number of geometry we have, and then we want to process all of them and detect overlapping polygons. Very similar sort of a problem we have is how do we do these analysis at scale? So we employ all of these as a form of automatic QA and also we could be doing adhoc analysis, doing geospatial distribution analysis.
Here’s one example of it. So what you see at the top here is we’re taking the polygons from the dataset. We’re doing some processing [inaudible] like PWI that were looking for right now. And then we take some of these SQL function and take the WKT definition and convert it into a geometry object or computing the area from it. And then lastly, we take the dataset and join it, using the ID and then computing and detecting whether there are any intersection between the two data sets. And lastly, we use that information to compute the ratio of the overlapping area. And that’s the result we’re looking for in this kind of analysis.
So lastly, moving on to architecture. So this is what it looks like. On the left hand side, you see a bunch of data set we have landing from a variety of different sources. We’re leveraging Delta Lake for [inaudible] an updates capability. We take all of that and process it using Apache Spark and Databricks. And at the same time, there’s model that is part of the pipeline that we use Spark for training the data, we use human being for long tail, kind of like understanding long tail and edge cases and performing that annotation on the data set. That feed into the training process. With Python and mlflow, we manage the model deployment and then the model performance tracking. And then as a result of that pipeline, we feed it into an automatic QA system, also using Spark that is hooked up to Slack so we get notification when things happen. And at the same time, we always have a human in loop, like analysis to qualitatively check all the result coming from the system. The result here is kind of pushed to the customer cloud you see on the right hand side.
So lastly, I want to point to the SafeGraph blog. As you see here, we have very interesting information about how we’re processing data. You see on the left hand side, what sort of data sourcing we have, our expansion to a new region like United Kingdom. And then we also have a number of interesting use cases, whether it’s investment research, right, risk modeling, market analysis, market penetration, competitive intelligence, or just like how far people are traveling during a pandemic. So last but not the least, really want to say that we’re really actively hiring a lot of interesting use cases. Please go to our website, SafeGraph.com/careers to find more information and thank you for your time. Love your feedback.

Felix Cheung

Felix is the VP of Engineering at SafeGraph, bringing over 20 years of engineering and 7 years of data experience. He led teams in Uber's Data Platform and was pivotal in rebuilding their open-source ...
Read more