This notebook was produced as a collaboration between SafeGraph and Databricks
We’ve created this Databricks notebook (.dbc download here), and published this blog, so that you can hit the ground running using SafeGraph Data from AWS Data Exchange in Databricks. For ready-to-run code, please see the complementary databricks notebook.
To see the full SafeGraph dataset, visit the SafeGraph Data Bar.
Learn more - register now for this webinar: Building Reliable Data Pipelines for Machine Learning at SafeGraph
This blog will show you:
The first half of this notebook shows how to read, load, and prepare the data. The second half shows how to answer analytics questions using spark sql.
Questions? Get in touch with us at [email protected].
SafeGraph is a geospatial data company focused on understanding the physical world. SafeGraph Patterns is a dataset of 3.6MM commercial brick-and-mortar points-of-interest (POI) in the USA and includes anonymized counts of how many people visit these POI each month. The counts of visitors are derived from an anonymized panel (sample of population that is measured longitudinally) of ~35MM mobile devices (e.g., smart phones) in the USA.
SafeGraph patterns is designed to answer questions like:
Protecting individual consumer privacy is at the core of the SafeGraph mission:
"SafeGraph’s mission is to make the world’s data open for innovation while protecting individuals privacy." - SafeGraph Vision and Values
The devices in the panel are fully anonymized; no identity or demographic information exists for devices in the panel, and individual device-level data is not present in SafeGraph products. The aggregated form of SafeGraph Patterns helps to ensure the protection of individuals' privacy, while also providing actionable data for statistical analysis and data science. For all the details on SafeGraph Patterns, see the SafeGraph Patterns docs.
Databricks is a unified analytics platform that enables data science, data engineering and business analytics teams to derive value from data at scale and with ease of use in a collaborative manner.
At its core, the Databricks platform is powered by Apache Spark and Delta Lake in a cloud native architecture, which gives users virtually unlimited horse power to acquire, clean, transform, combine and analyze data sets within minutes from a notebook interface, with popular languages of choice (python, scala, SQL, R).
Because Databricks is a managed platform, customers do not have to become big data devops gurus to power their analytical needs, which reduces administrative burden, costs and risks of their data driven projects.
Delta Lake, as also featured in the Safegraph notebooks below, brings unique capabilities to the Databricks platform:
To demonstrate the power of SafeGraph data inside Databricks, we are highlighting three datasets from SafeGraph currently available for free inside AWS Exchange.
Follow these steps to subscribe to Safegraph datasets in AWS Data Exchange
Once you have SafeGraph data loaded into Databricks, a bunch of exciting answers about consumer behavior are at your fingertips.
To see these implemented in a Databricks notebook, checkout the accompanying Demo Notebook.
With a few lines of code you can examine the relative popularity of individual locations of Starbucks, as well as the average popularity across Starbucks nation-wide. Each safegraph_place_id is a different unique Starbucks location. The x-axis shows each hour of the day (local time) from midnight (0) to 11pm (23). The y-axis reflects how many visits are happening at each hour, summed across all the days of the month, as a percent of total visits of the entire month (Note, visits that cross hour- boundaries will be counted in multiple hours. Therefore, the total % across all hours may add up to > 100%.)
We see that although traffic certainly ramps up during the morning, peak traffic is actually around 12pm and 1pm.
We can ask the same question but about what days of the week are popular.
Looking at 20 random starbucks examples we see that on average no days are strongly preferred over others. However, some POI do show interesting weekend vs weekday differences.
We can examine one of these POI and compare it to the national average.
This data shows that, on average nationally, the busiest days of the week at Starbucks are Wednesdays and Thursdays, although this is a mild preference. In contrast, safegraph_place_id sg:68513387500e48eb87d719207d058309 shows a very different pattern and is significantly less popular during the weekends compared to weekdays.
To visualize where this POI is located, you can read the (latitude, longitude) from the SafeGraph dataset and search for it in Google Maps. It turns out that this particular Starbucks is located on the campus of the Boston University School of Law. Presumably the fact that classes are not held during weekends is causing this very large weekday vs weekend difference.
SafeGraph reports the median distance travelled (from the home census block group) for each POI. Using this we can construct a histogram of Starbucks locations, showing how far people travel to visit Starbucks.
This data shows that most Starbucks locations draw visitors that live less than 10 kilometers away. However there is a long thin tail of Starbucks locations with the median distance from home is hundreds of km. These locations are likely in high-tourist or high-commute areas (like in an airport) where most visitors do not live geographically nearby.
The column related_same_month_brand and related_same_day_brand reports an index of how frequently visitors to a POI visit also visit other brands (relative to the average visitor rate to that brand).
Here we look at what other brands are frequently visited by customers of Starbucks. The larger the index, the more frequently starbucks customers visit that brand.
Although Starbucks is a national chain, cross-brand shopping is highly influenced by local geography. Here we show the top 5 top cross-shopping brands for Starbucks customers in California, New York, and Texas. Only McDonald's is in the Top 5 of all 3 states.
You can use SafeGraph data from AWS Data Exchange in Databricks to analyze the customer demographics of individual POI or brands. For a deep dive on the methodology, along with more complete statistical analysis feel free to read this workbook.
Here we analyze Starbucks Customer Demographics along the Race Demographic dimension using available from SafeGraph in AWS Data Exchange.
This analysis could be repeated for any demographic information tracked by the Census, and reported at the census block group level. That includes Ethnicity, Educational Attainment, Household Income, and much, much more.
To do this analysis we will use:
The y-axis shows the % of total visitors for each demographic segment.
The baseline demographics of the United States are shown as a reference. SafeGraph Patterns shows interesting differences between the census area demographics of Starbucks Customers compared to the overall USA population
Importantly, these differences are not due to geographic sampling bias in the SafeGraph dataset. It is true that the SafeGraph dataset has some small geographic biases. For a full report see "What about bias in the SafeGraph dataset?". However, we are able to measure and correct the small effects of sampling bias in the SafeGraph dataset as part of the cbg_adjust_factor calculation. If the differences observed were due solely to geographic sampling bias in the SafeGraph dataset, then they would disappear after the correction. The differences that remain cannot be attributed to sampling bias. For a thorough discussion on this methodology, see A Workbook to Analyze Demographic Profiles from SafeGraph Patterns Data.
Want to get more SafeGraph data?