Businesses want to understand both the physical world around them and how people interact with the physical world. Where should I build my next coffee shop? How far away are my 3 closest coffee competitors? How far are people traveling to get to my stores? Which other brands do people visit before and after they visit mine?
These questions are no-doubt interesting for urban planners, advertisers, hedge funds, and brick-and-mortar businesses. But they also hint at an interesting technical problem: machine learning at scale. Building a dataset to answer these questions comprehensively is a massively difficult computation problem because it requires heavy machine learning at a significant scale. And throughout this post, we’ll tell you exactly how we solved it.
First, some context on SafeGraph. Our goal is to be the one-stop-shop for anyone seeking to understand the physical places around them — restaurants, airports, colleges, salons...the list goes on.
To serve this mission, we create datasets that represent the world around us. One such dataset is our Core Places product, which is a listing of 5MM+ businesses around the country, complete with rich information like category and open hours. This dataset is complemented by Geometry, a supplementary dataset that associates each place with a geofence to indicate the building’s physical footprint. Below are SafeGraph’s 3 main datasets, each of which tells a different story of the physical around. This post is about how we built Patterns, a dataset about the physical places around us and how humans interact with them.

Ultimately, we wanted Patterns to be keyed by safegraph_place_id, which is our canonical identifier for each place in our dataset. Each row would be a unique physical place, and we planned to compute a set of columns for each place which collectively described how people interacted with it. Some examples of columns we compute include the number of visitors, the hours throughout the day in which the place is most popular, and a list of other brands that people visited before or after visiting the place in question.
A few columns of our Patterns dataset, from shop.safegraph.com. Use the coupon code data4databricksers for $100 in free points-of-interest, building footprint, and foot-traffic insights data.
To build this mapping between places and visit statistics, we first needed to build an internal dataset which associated our anonymous, internal GPS feed to the physical places in Places using our dataset of geofences. Once we had a dataset which associated clusters of GPS points to our safegraph_place_id key (we call each association a visit), we could simply “roll-up” the data by our key to build Patterns. In many ways, the core technical challenge came down to building this dataset of visits.
We started with three core ingredients: (1) Places, a dataset of points of interest around the US, (2) Geometry, the physical building footprints for those places, and (3) a daily, anonymized feed of GPS data sourced from apps.
