INSTRUCTOR-LED

Apache Spark Overview

DB 100Request Info

Overview

This 1-day course is for data engineers, analysts, architects, data scientist, software engineers, IT operations, and technical managers interested in a brief hands-on overview of Apache Spark.

The course provides an introduction to the Spark architecture, some of the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

Objectives

After taking this class, students will be able to:

  • Use a subset of the core Spark APIs to operate on data.
  • Articulate and implement simple use cases for Spark
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Create Structured Streaming jobs
  • Understand how a Machine Learning pipeline works
  • Understand the basics of Spark’s internals

Topics

  • Spark Overview
  • Introduction to Spark SQL and DataFrames, including:
    • Reading & Writing Data
    • The DataFrames/Datasets API
    • Spark SQL
    • Caching and caching storage levels
  • Overview of Spark internals
    • Cluster Architecture
    • How Spark schedules and executes jobs and tasks
    • Shuffling, shuffle files, and performance
    • The Catalyst query optimizer
  • Spark Structured Streaming
    • Sources and sinks
    • Structured Streaming APIs
    • Windowing & Aggregation
    • Checkpointing & Watermarking
    • Reliability and Fault Tolerance
  • Overview of Spark’s MLlib Pipeline API for Machine Learning
    • Transformer/Estimator/Pipeline API
    • Perform feature preprocessing
    • Evaluate and apply ML models

Details

  • Duration: 1 Day
  • Hours: 9:00 a.m. – 5:00 p.m.

Target Audience

Data engineers, analysts, architects, data scientist, software engineers, and technical managers who want a quick introduction into how to use Apache Spark to streamline their big data processing, build production Spark jobs, and understand and debug running Spark applications.

Prerequisites

  • Some familiarity with Apache Spark is helpful but not required.
  • Knowledge of SQL is helpful.
  • Basic programming experience in an object-oriented or functional language is highly recommended but not required. The class can be taught concurrently in Python and Scala.

Lab Requirements

  • A computer or laptop
  • Chrome or Firefox Web Browser Internet explorer and Safari are not supported
  • Internet access with unfettered connections to the following domains:
    1. *.databricks.com - required
    2. *.slack.com - highly recommended
    3. spark.apache.org - required
    4. drive.google.com - helpful but not required

Course Syllabus

Module Lecture Hands-on

Apache Spark Overview

  • Overview of Databricks
  • Spark Capabilities
  • Spark Ecosystem
  • Basic Spark Components
  • Databricks Lab Environment
  • Working with Notebooks
  • Spark Clusters and Files

Apache Spark SQL and DataFrames

  • Use of Spark SQL
  • Use of DataFrames / DataSets
  • Reading & Writing Data
  • DataFrame, DataSet and SQL APIs
  • Catalyst Query Optimization
  • ETL
  • Creating DataFrames
  • Querying with DataFrames
  • Querying with SQL
  • ETL with DataFrames
  • Visualization

Structured Streaming

  • Streaming Sources and Sinks
  • Structured Streaming APIs
  • Windowing & Aggregation
  • Checkpointing
  • Watermarking
  • Reliability and Fault Tolerance
  • Reading from TCP
  • Continuous Visualization

Machine Learning

  • Spark MLlib Pipeline API
  • Built-in Featuring and Algorithms
  • Featurization
  • Building a Machine Learning Pipeline