SafeGraph is the source of truth for data on non-residential physical places. We provide points of interest, building footprint, and foot traffic data for over 7 million locations in the US, UK, and Canada.
The rapid growth of business brings several challenges to the tech stack of SafeGraph . Comparing to the other companies which serve customers with apps and online services, offering a dataset as a product places unprecedented challenges on data versioning, data quality as well as the reliability and efficiency of data processing. Additionally, scaling our engineers to catch up with the rapidly increasing customer demands requires a sophisticatedly designed internal toolkit. The toolkit not only needs to bring the state-of-the-art ML/Data infra technology to SafeGraph but also to be minimally intrusive to avoid interrupt the product development.
In this talk, we are presenting the experiences we got from building the data platform and internal toolkit in SafeGraph: First, we will introduce the integration experience of Delta Lake + MLFlow in SafeGraph. Our integration not only improves the tracking and debuggability of our data processing stack, but also significantly improves the productivity of our ML engineers and most importantly with the minimum interruption to their existing workflow. Second, we will cover how we improve the observability of Spark applications and the manageability of our Databricks-based data processing platform with a set of internal tools. Finally, we will share the story on how we identify the bottleneck of Spark as well as Scala standard library and then we scale our Spark applications to handle the complicated data ingestion scenario.
Nan is a Software Engineer in SafeGraph, where he works on building the data platform to support the scaling of business of this data company. He also serves as the PMC member of XGBoost, one of the most popular machine learning libraries.