Building a Unified Data Pipeline with Apache Spark and XGBoost

Download Slides

XGBoost ( is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost. While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.

We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (, which seamlessly integrates Apache Spark and XGBoost. The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.

Session hashtag: #SFeco11

Learn more:

  • Install and Use XGBoost
  • Building Complex Data Pipelines with Unified Analytics Platform
  • Databricks’ Data Pipelines: Journey And Lessons Learned
  • Unfied Data Analytics Platform

  • « back
    About Nan Zhu

    Nan is a Software Engineer in SafeGraph, where he works on building the data platform to support the scaling of business of this data company. He also serves as the PMC member of XGBoost, one of the most popular machine learning libraries.