Nan Zhu - Databricks

Nan Zhu

Software Engineer, Microsoft

Nan Zhu is a Software Engineer from Microsoft, where he works on serving Spark Streaming/Structured Streaming on Azure HDInsight. He is a contributor of Apache Spark (known as CodingCat) and also serves as the committee member of Distributed Machine Learning Community (DMLC) and Apache MxNet (incubator).



Building Continuous Application with Structured Streaming and Real-Time Data SourceSummit 2017

One of the biggest challenges in data science is to build a continuous data application which delivers results rapidly and reliably. Spark Streaming offers a powerful solution for real-time data processing. However, the challenge remains in how to connect them with various continuous and real-time data sources, guaranteeing the responsiveness and reliability of data applications. In this talk, Nan and Arijit will summarize their experiences learned from serving the real-time Spark-based data analytic solutions on Azure HDInsight. Their solution seamlessly integrates Spark and Azure EventHubs which is a hyper-scale telemetry ingestion service enabling users to ingress massive amounts of telemetry into the cloud and read the data from multiple applications using publish-subscribe semantics. They'll will cover three topics: bridging the gap of data communication model in Spark and data source, accommodating Spark to rate control and message addressing of data source, and the co-design of fault tolerance Mechanisms. This talk will share the insights on how to build continuous data applications with Spark and boost more availabilities of connectors for Spark and different real-time data sources. Session hashtag: #SFdev12

Building a Unified Data Pipeline with Apache Spark and XGBoostSummit 2017

XGBoost ( is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost. While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment. We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (, which seamlessly integrates Apache Spark and XGBoost. The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic. Session hashtag: #SFeco11 Learn more:

  • Install and Use XGBoost
  • Building Complex Data Pipelines with Unified Analytics Platform
  • Databricks’ Data Pipelines: Journey And Lessons Learned