Apache® Spark Tuning and Best Practices

DATABRICKS SPARK 110Request Info

Overview

This 1-day course is primarily for data engineers, software engineers, dev-ops, IT operations, and team-leads but is directly applicable to analysts, architects, data scientist, and technical managers interested in troubleshooting and optimizing Apache Spark applications.

This course provides a deeper understanding of how to tuning Spark applications, general best practices, anti-patterns to avoid, and other measures to help tune and troubleshoot Spark applications and queries.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

Learning Objectives

After taking this class, students will be able to:

  • Understand the role of memory in Spark applications
  • Properly use broadcast variables and, in particular, broadcast joins, to increase the performance of DataFrame operations
  • Explain how the Catalyst Query Optimizer works to increase query performance
  • Better manage Spark’s partitioning and shuffling behavior
  • Properly size a Spark cluster for different kinds of workflows

Topics

  • Spark Memory Usage
    • Using the Spark UI and Spark logs to determine how much memory your application is using
    • Understanding how Tungsten (used by DataFrames and Datasets) dramatically improves memory use, compared to the RDD API
    • Why it’s important that DataFrames never be partially cached, even if it means spilling the cache to disk
    • The benefits of co-located data
    • Tuning JVM garbage collection for Spark
  • Broadcast Variables
    • How broadcast variables can affect performance
    • Why broadcast joins are useful
    • How to force Spark to do a broadcast join
    • When not to force a broadcast join
  • Catalyst
    • Avoiding Catalyst anti-patterns, such as Cartesian products and partially cached DataFrames
    • Efficient use of the Datasets API within a query plan
    • Understanding how encoders and decoders affect Catalyst optimizations
    • How and when to write a custom Catalyst optimizer
    • Tuning Shuffling
    • When does shuffling occur?
    • Understanding how shuffling affects repartitioning
    • Understanding shuffling impact on network I/O
    • Narrow vs. wide transformations
    • Spark configuration settings that affect shuffling
  • Cluster Sizing
    • How a lack of memory affects how you should size your disks
    • The importance of properly defined schemas on memory use
    • Hardware provisioning
      • How to decide how much memory to allocate to each machine
      • Network considerations
      • How to decide how many CPU cores each machine will need
    • FIFO scheduler vs. fair scheduler

Details

  • Duration: 1 Day

Target Audience

Data engineers, software engineers, dev-ops, IT operations, and team-leads with experience on an Apache Spark project who want to improve the general performance of their applications.

Prerequisites

  • Applicable experience with Apache Spark projects.
  • A strong understanding of the DataFrame/Dataset APIs.
  • Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.

Lab Requirements

  • A computer or laptop
  • Chrome or Firefox Web Browser Internet explorer and Safari are not supported
  • Internet access with unfettered connections to the following domains:
    1. *.databricks.com - required
    2. *.slack.com - highly recommended
    3. spark.apache.org - required
    4. drive.google.com - helpful but not required