Nan Zhu

Software Engineer, SafeGraph

Nan is a Software Engineer in SafeGraph, where he works on building the data platform to support the scaling of business of this data company. He also serves as the PMC member of XGBoost, one of the most popular machine learning libraries.

Past sessions

Summit 2021 Building Source of Truth Place Data at Scale

May 26, 2021 04:25 PM PT

SafeGraph is the source of truth for data on non-residential physical places. We provide points of interest, building footprint, and foot traffic data for over 7 million locations in the US, UK, and Canada.

The rapid growth of business brings several challenges to the tech stack of SafeGraph . Comparing to the other companies which serve customers with apps and online services, offering a dataset as a product places unprecedented challenges on data versioning, data quality as well as the reliability and efficiency of data processing. Additionally, scaling our engineers to catch up with the rapidly increasing customer demands requires a sophisticatedly designed internal toolkit. The toolkit not only needs to bring the state-of-the-art ML/Data infra technology to SafeGraph but also to be minimally intrusive to avoid interrupt the product development.

In this talk, we are presenting the experiences we got from building the data platform and internal toolkit in SafeGraph: First, we will introduce the integration experience of Delta Lake + MLFlow in SafeGraph. Our integration not only improves the tracking and debuggability of our data processing stack, but also significantly improves the productivity of our ML engineers and most importantly with the minimum interruption to their existing workflow. Second, we will cover how we improve the observability of Spark applications and the manageability of our Databricks-based data processing platform with a set of internal tools. Finally, we will share the story on how we identify the bottleneck of Spark as well as Scala standard library and then we scale our Spark applications to handle the complicated data ingestion scenario.

In this session watch:
Nan Zhu, Software Engineer, SafeGraph

[daisna21-sessions-od]

One of the biggest challenges in data science is to build a continuous data application which delivers results rapidly and reliably. Spark Streaming offers a powerful solution for real-time data processing. However, the challenge remains in how to connect them with various continuous and real-time data sources, guaranteeing the responsiveness and reliability of data applications.
In this talk, Nan and Arijit will summarize their experiences learned from serving the real-time Spark-based data analytic solutions on Azure HDInsight. Their solution seamlessly integrates Spark and Azure EventHubs which is a hyper-scale telemetry ingestion service enabling users to ingress massive amounts of telemetry into the cloud and read the data from multiple applications using publish-subscribe semantics.

They'll will cover three topics: bridging the gap of data communication model in Spark and data source, accommodating Spark to rate control and message addressing of data source, and the co-design of fault tolerance Mechanisms. This talk will share the insights on how to build continuous data applications with Spark and boost more availabilities of connectors for Spark and different real-time data sources.

Session hashtag: #SFdev12

XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost. While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.

We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost. The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.

Session hashtag: #SFeco11

Learn more:

  • Install and Use XGBoost
  • Building Complex Data Pipelines with Unified Analytics Platform
  • Databricks’ Data Pipelines: Journey And Lessons Learned
  • Unfied Data Analytics Platform