INSTRUCTOR-LED

Apache Spark Programming

DB 105Request Info

Overview

This 3-day course is equally applicable to data engineers, data scientist, analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark.

The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

Learning Objectives

After taking this class, students will be able to:

  • Use the core Spark APIs to operate on data
  • Articulate and implement typical use cases for Spark
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Analyze Spark jobs using the administration UIs inside Databricks
  • Create Structured Streaming jobs
  • Work with relational data using the GraphFrames APIs
  • Understand how a Machine Learning pipeline works
  • Understand the basics of Spark’s internals

Topics

  • Spark Overview
  • In-depth discussion of Spark SQL and DataFrames, including:
    • The DataFrames/Datasets API
    • Spark SQL
    • Data Aggregation
    • Column Operations
    • The Functions API: date/time, string manipulation, aggregation
    • Joins & Broadcasting
    • User Defined Functions
    • Caching and caching storage levels
    • Use of the Spark UI to analyze behavior and performance
  • In-depth discussion of Spark internals
    • Cluster Architecture
    • The Catalyst query optimizer
    • The Tungsten in-memory data format
    • How Spark schedules and executes jobs and tasks
    • Shuffling, shuffle files, and performance
    • How various data sources are partitioned
    • How Spark handles data reads and writes
  • Spark Structured Streaming
    • Sources and sinks
    • Structured Streaming APIs
    • Windowing & Aggregation
    • Checkpointing & Watermarking
    • Reliability and Fault Tolerance
    • Kafka Integration
  • Overview of Spark’s MLlib Pipeline API for Machine Learning
    • Transformer/Estimator/Pipeline API
    • Perform feature preprocessing
    • Evaluate and apply ML models
  • Graph processing with GraphFrames
    • Transforming DataFrames into a graph
    • Perform graph analysis, including Label Propagation, PageRank, and ShortestPaths

Details

  • Duration: 3 Days
  • Hours: 9:00 a.m. – 5:00 p.m.

Target Audience

Data engineers, analysts, architects, data scientist, software engineers, and technical managers who want to learn the fundamentals of programming with Apache Spark, how to streamline their big data processing, build production Spark jobs, and understand/debug running Spark applications.

Prerequisites

  • Some familiarity with Apache Spark is helpful but not required.
  • Knowledge of SQL is helpful.
  • Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.

Lab Requirements

  • A computer or laptop
  • Chrome or Firefox Web Browser Internet Explorer and Safari are not supported
  • Internet access with unfettered connections to the following domains:
    1. *.databricks.com - required
    2. *.slack.com - highly recommended
    3. spark.apache.org - required
    4. drive.google.com - helpful but not required

Course Syllabus

Module Lecture Hands-on

Spark Overview

  • Databricks Overview
  • Spark Capabilities
  • Spark Ecosystem
  • Basic Spark Components
  • Databricks Lab Environment
  • Working with Notebooks
  • Spark Clusters and Files

Spark SQL and
DataFrames

  • Use of Spark SQL
  • Use of DataFrames / DataSets
  • Reading from CSV, JSON, JDBC, Parquet Files & more
  • Writing Data
  • DataFrame, DataSet and SQL APIs
  • Aggregations
  • SQL Joins with DataFrames
  • Broadcasting
  • Catalyst Query Optimization
  • Tungsten
  • ETL
  • Creating DataFrames
  • Querying with DataFrames and SQL
  • ETL with DataFrames
  • Caching
  • Visualization

Spark Internals

  • Jobs, Stages and Tasks
  • Partitions and Shuffling
  • Job Performance
  • Visualizing SQL Queries
  • Observing Task Execution
  • Understanding Performance
  • Measuring Memory Use

Structured Streaming

  • Streaming Sources and Sinks
  • Structured Streaming APIs
  • Windowing and Aggregation
  • Checkpointing
  • Watermarking
  • Reliability and Fault Tolerance
  • Reading from TCP
  • Reading from Kafka
  • Continuous Visualization

Machine Learning

  • Spark ML Pipeline API
  • Built-in Featurizing and Algorithms
  • Featurization
  • Building a Machine Learning Pipeline

Graph Processing with GraphFrames

  • Basic Graph Analysis
  • GraphFrames API
  • GraphFrames ETL
  • Pagerank and Label Propagation with GraphFrames