Apache Spark for Machine Learning and Data Science

DB 301Request Info


This 3-day course is primarily for data scientists but is directly applicable to analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark and its applications to Machine Learning.

The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, Spark’s streaming capabilities and a heavy focus on Spark’s machine learning APIs. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them after the class ends; all examples are guaranteed to run in the environment the class was taught on (Azure Databricks or CE)

Learning Objectives

After taking this class, students will be able to:

  • Understand when and where to use Spark
  • Articulate the difference between an RDD, DataFrame, and Dataset
  • Explain supervised vs unsupervised machine learning, and typical applications of both
  • Build a Machine Learning Pipeline using a combination of Transformers and Estimators
    • Save/Restore Models
    • Apply models to streaming data
  • Perform hyperparameter tuning with cross-validation
  • Analyze Spark query performance using the Spark UI
  • Train models with 3rd party libraries such as XGBoost
  • Perform hyperparameter search in parallel using single node algorithms such as scikit-learn
  • Gain familiarity with Decision Trees, Random Forests, Gradient Boosted Trees, Linear Regression, Collaborative Filtering, and K-Means
  • Explain options for putting models into production


  • Spark Overview
  • In-depth discussion of Spark SQL and DataFrames, including:
    • RDD vs DataFrame vs Dataset API
    • Spark SQL
    • Data Aggregation
    • Column Operations
    • The Functions API: date/time, string manipulation, aggregation
    • Caching and caching storage levels
    • Use of the Spark UI to analyze behavior and performance
  • Overview of Spark internals
    • Cluster Architecture
    • How Spark schedules and executes jobs and tasks
    • Shuffling, shuffle files, and performance
    • The Catalyst query optimizer
  • Spark Structured Streaming
    • Sources and sinks
    • Structured Streaming APIs
    • Windowing & Aggregation
    • Checkpointing & Watermarking
    • Reliability and Fault Tolerance
  • In-depth overview of Spark’s MLlib Pipeline API for Machine Learning
    • Build machine learning pipelines for both supervised and unsupervised learning
    • Transformer/Estimator/Pipeline API
    • Use transformers to perform pre-processing on a dataset prior to training
    • Train analytical models with Spark ML’s DataFrame-based estimators including Decision Trees, Random Forests, Gradient Boosted Trees, Linear Regression, K-Means, and Alternating Least Squares
    • Tune hyperparameters via cross-validation and grid search
    • Evaluate model performance
  • MLflow
    • Track and benchmark model performance
  • 3rd Party Library Integrations
    • XGBoost
    • How to distribute single-node algorithms (like scikit-learn) with Spark
      • Spark-Sklearn: Perform scikit-learn hyperparameter search in parallel
  • Production Discussion


  • Duration: 3 Days
  • Hours: 9:00 a.m. – 5:00 p.m.

Target Audience

Data scientists, analysts, architects, software engineers, and technical managers with experience in machine learning who want to adapt traditional machine learning tasks to run at scale using Apache Spark.


  • Some familiarity with Apache Spark is helpful but not required.
  • Some familiarity with Machine Learning and Data Science concepts are highly recommended but not required.
  • Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.

Lab Requirements

  • A computer or laptop
  • Chrome or Firefox Web Browser Internet explorer and Safari are not supported
  • Internet access with unfettered connections to the following domains:
    1. * - required
    2. * - highly recommended
    3. - required
    4. - helpful but not required

Course Syllabus

Module Lecture Hands-on

Spark Overview

  • Overview of Databricks
  • Spark Capabilities
  • Spark Ecosystem
  • Basic Spark Components
  • Databricks Lab Environment
  • Working with Notebooks
  • Spark Clusters and Files

Spark SQL and DataFrames

  • Use of Spark SQL
  • Use of DataFrames / DataSets
  • Reading & Writing Data
  • DataFrame, DataSet and SQL APIs
  • Catalyst Query Optimization
  • Tungsten
  • ETL
  • Creating DataFrames
  • Querying with DataFrames
  • Querying with SQL
  • ETL with DataFrames
  • Caching
  • Visualization

Spark Internals

  • Jobs, Stages, and Tasks
  • Partitions and Shuffling
  • Job Performance
  • Visualizing SQL Queries
  • Observing Task Execution
  • Understanding Performance
  • Measuring Memory Use

Machine Learning

  • Spark MLlib Pipeline API
  • Built-in Featurizing and Algorithms
  • Cross-Validation and Grid Search for Hyperparameter Tuning
  • Evaluation Metrics
  • Data Partitioning Strategies
  • Spark integration with Scikit-learn
  • NLP/Text Classification with Logistic Regression
  • Decision Tree vs. Random Forest
  • Data imputation with Alternating Least Squares
  • Clustering with K-Means
  • Neural Networks
  • Spark-sklearn

Structured Streaming

  • Streaming Sources and Sinks
  • Structured Streaming APIs
  • Windowing & Aggregation
  • Checkpointing
  • Watermarking
  • Reliability and Fault Tolerance
  • Reading from TCP
  • Continuous Visualization

Graph Processing with GraphFrames

  • Basic Graph Analysis
  • GraphFrames API
  • GraphFrames ETL
  • Pagerank and Label Propagation with GraphFrames