INSTRUCTOR-LED

Apache Spark Tuning and Best Practices

DB 110Request Info

Overview

This 3-day course is primarily for software engineers but is directly applicable to analysts, architects and data scientist interested in a deep dive into the processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications.

This course is a lab-intensive workshop in which students implement various best practices while inducing, diagnose and then fixing various performance problems. The course continues with numerous instructor-lead coding challenges to refactor existing code with the effect of an increase in overall performance by applying learned best practices. It then concludes with a full day workshop in which students work individually or in teams to complete a typical, full-scale data migration of a poorly maintained dataset.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

Learning Objectives

After taking this class, students will be able to:

  • Shortcut investigations by developing common-sense intuition as to the root cause of various performance issues.
  • Diagnose & fix various storage-related performance issues including
    • The tiny files problem
    • Malformed partitions
    • Unpartitioned data
    • Overpartitioned data
    • Incorrectly typed data
  • Identify when and when not to cache data
  • Articulate the performance ramifications of different caching strategies
  • Diagnose & fix common coding mistakes that lead to de-optimization
  • Optimize joins via broadcasting, pruning, and pre-joining
  • Apply tips and tricks for
    • Investigating the files system
    • Diagnosing partition skew
    • Developing and distributing utility functions
    • Developing micro-benchmarks & avoiding related pitfalls
    • Rapid ETL development for extremely large datasets
    • Working in shared cluster environments
  • Develop different strategies for testing ETL components
  • Rapidly develop insights on otherwise costly datasets

Topics

  • Coding Exercises
  • Partitioning
    • Explore the effects of different partitioning strategies
    • Diagnose performance problems related to improperly partitioned data
    • Explore different solutions to fixing mal-partitioned data
    • Working with and understanding on-disk partitioning strategies
  • Caching
    • Develop tips and tricks for caching data on shared clusters
    • Explore the ramifications of different caching strategies
    • Learn why caching is one of the most common performance problems
    • Develop intuitions as to when and when not to cache data
    • How to use caching as an aid to troubleshooting
  • Joins
    • Explore different options for optimizing joins
    • Working with broadcast joins
    • Explore different options for avoiding joins
  • Utility Functions
    • Diagnosing performance problems
    • Caching data
    • Benchmarking
    • Common ETL tasks
    • Discuss deployment strategies for utility functions
  • Testing strategies
    • Strategies for testing transformations
    • Developing test datasets for unit tests
  • De-optimization
    • Exploring common coding practices that induce de-optimization
    • Solutions for avoiding de-optimization
    • Review of the Catalyst Optimizer and its role in optimizing applications

Details

  • Duration: 3 Day

Target Audience

Data engineers, analysts, architects, data scientist and software engineers who want to further their skills by learning how to develop high-performance Spark applications through the use of best practices and by diagnosing and troubleshooting common performance problems.

Prerequisites

  • Proficiency with Apache Spark's DataFrames API is helpful but not required.
  • Intermediate to advanced programming experience in Python or Scala is required.
  • Several weeks of experience in developing Apache Spark applications is preferred.
  • The class can be taught concurrently in Python and Scala.

Lab Requirements

  • A computer or laptop
  • Chrome or Firefox Web Browser Internet Explorer and Safari are not supported
  • Internet access with unfettered connections to the following domains:
    1. *.databricks.com - required
    2. *.slack.com - highly recommended
    3. spark.apache.org - required
    4. drive.google.com - helpful but not required