Apache Spark Training - Databricks


Spark + AI Summit 2019 features a number of one-day training workshops that include a mix of instruction and hands-on exercises to help you improve your Apache Spark™ and Data Engineering skills.  Learn how to leverage Apache Spark as your unified analytics engine for building data pipelines and machine learning.   Expand upon your data sciences skills by better understanding the machine learning lifecycle with MLflow or diving into a deep learning tutorial with Keras and TensorFlow.

The training workshops are offered as add-ons to the Conference Pass.

Students will need to bring their own laptops with Chrome or Firefox browser and access to *.databricks.com.

Building Data Pipelines for Apache Spark™ With Databricks Delta

Delta is Databricks’ next-gen engine built on top of Apache Spark. This course is for data engineers, architects, data scientists and software engineers who want to use Databricks Delta for building pipelines for data lakes with high data reliability and performance. The course will cover typical data reliability and performance challenges that data lakes face and teach how to address them using Delta. The course ends with a capstone project building a complete data pipeline using Databricks Delta.


  • Completed the Getting Started with Apache Spark™ SQL, Getting Started with Apache Spark™ DataFrames, or ETL Part 1 course, or already have similar knowledge


  • Creating Delta tables for a data lake
  • Appending records to a Databricks Delta table
  • Performing UPSERTs of data into existing Databricks Delta tables
  • Reading and writing streaming data into a data lake
  • Optimizing a data pipeline and optimization best practices
  • Architecture
    • Comparison with Lambda architecture
    • Getting streaming Wikipedia data into a data lake via Kafka broker
    • Writing streaming data into a raw table
    • Cleaning up bronze data and generating normalized query tables
    • Creating summary tables of key business metrics
    • Creating plots/dashboards of business metrics

Data Science with Apache Spark™

The Data Science with Apache Spark workshop will show you how to use Apache Spark to perform exploratory data analysis (EDA), develop machine-learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. It is designed for software developers, data analysts, data engineers, and data scientists.

This workshop will also cover parallelizing machine-learning algorithms at a conceptual level. Taking a pragmatic approach, the workshop will focus on using Apache Spark for data analysis and building models using MLlib, while limiting time spent on machine-learning theory and the internal workings of Spark.

We will work through examples that show you how to apply Apache Spark to iterate faster and develop models on massive datasets. This workshop will provide tools to be productive using Spark on practical data-analysis tasks and machine-learning problems. After completing this workshop, you should be comfortable using DataFrames, the DataFrames MLlib API, and related documentation. These building blocks will enable you to use Apache Spark to solve a variety of data-analysis and machine-learning tasks.


  • Programming experience in Python or Scala.
  • Background in data science very helpful (recommended).
  • Basic knowledge of Spark DataFrames (recommended).
  • Brief conceptual reviews of data science techniques will be performed before the techniques are used. Labs and demos will be available in both Python and Scala.


  • Data Cleansing and Exploratory Data Analysis (EDA)
  • Feature Extraction and Transformation using MLlib
  • MLlib Pipelines: Transformers and Estimators
  • Model Parallel vs Data Parallel
  • Linear Regression, Decision Trees, Random Forests and ensembles
  • MLflow to track model results

Hands on Deep Learning with Keras, TensorFlow, and Apache Spark™

This course is aimed at the practicing data scientist who is eager to get started with deep learning, along with software engineers and technical managers interested in a thorough, hands-on overview of deep learning and its integration with Apache Spark.

The course covers the fundamentals of neural networks and how to build distributed TensorFlow models on top of Spark DataFrames. Throughout the class, you will use Keras, TensorFlow, Deep Learning Pipelines, and Horovod to build and tune models. This course is taught entirely in Python.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment.


  • Python (numpy and pandas)
  • Background in data science very helpful (recommend)
  • Basic knowledge of Spark DataFrames


  • Intro to Neural Networks with Keras
    • Neural network architectures
    • Activation functions
    • Evaluation metrics
    • Batch sizes, epochs, etc.
  • MLflow
    • Reproducible ML/DL
  • Convolutional Neural Networks
    • Convolutions
    • Batch Normalization
    • Max Pooling
    • ImageNet Architectures
  • Deep Learning Pipelines
    • Model inference at scale
  • Horovod
    • Distributed TensorFlow training
    • Ring-All Reduce

Apache Spark™ Tuning and Best Practices Featuring Databricks Delta

This one-day course is for data engineers, analysts, architects, dev-ops, and team-leads interested in troubleshooting and optimizing Apache Spark applications. It covers troubleshooting, tuning, best practices, anti-patterns to avoid, and other measures to help tune and troubleshoot Spark applications and queries. It also introduces Delta, Databricks’ next-generation engine built on top of Apache Spark and how it can help you build robust data pipelines.

Each topic includes lecture content, along with hands-on use of Spark through an elegant Web-based notebook environment. Inspired by tools such as IPython/Jupyter, notebooks allow attendees to code jobs, data analysis queries, and visualizations using their own Spark cluster, accessed through a Web browser. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering; all examples are guaranteed to run in that environment. Alternatively, each notebook can be exported as source code and run within any Databricks environment.


  • Project experience or with Apache Spark
  • Apache Spark™ Programming – DB 105 or equivalent
  • Basic programming experience in an object oriented or functional language is required. The class will be taught using a mixture of Python and Scala.


  • The role of memory in Spark applications
  • How to use broadcast variables and, in particular, broadcast joins to increase the performance of DataFrame operations
  • The Catalyst query optimizer
  • How to tune Spark’s partitioning and shuffling behavior
  • How best to size a Spark cluster for different kinds of workflows
  • How Delta can help with performance

Apache Spark™ Programming + Databricks Delta

This one-day course is for data engineers, analysts, architects, data scientist, software engineers, IT operations, and technical managers interested in a brief hands-on overview of Apache Spark™ and building data pipelines with Databricks Delta, the next-gen unified analytics engine.

The course provides an introduction to the Spark architecture, some of the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities, machine learning APIs and the use of Databricks Delta in building pipelines. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.


  • Using a subset of the core Spark APIs to operate on data
  • Articulating and implementing simple use cases for Spark
  • Building data pipelines and query large data sets using Spark SQL and DataFrames
  • Creating Structured Streaming jobs
  • Understanding how a Machine Learning pipeline works
  • Understanding the basics of Spark’s internals
  • Introduction to Building Robust Data Pipelines using Delta

Half-day Prep-course + Databricks Certification Exam

This half-day lecture is for anyone seeking to learn more about the different certifications offered by Databricks including the Databricks Certified Developer for Apache Spark 2.x or our new exam, the Databricks Certified Associate for Apache Spark 2.4.

It includes test-taking strategies, sample questions, preparation guidelines, and exam requirements for all certifications. The primary goal of this course is to help potential applicants understand the breadth and depth of knowledge on which individuals will be tested and to provide guidelines on how to prepare for the exam.

Attendees who select the prep course will have the option to take either exam after the course is completed.


  • Introduction to the Databricks environment
  • Test-taking strategies specific to the certification exam
  • Sample exam questions
  • Suggested study materials including  books, MOOCs, self-paced courses, instructor-led training, etc.
  • Exam-specific prerequisites
  • One attempt at the certification exam

This is not a programming course.

Please Note: Attending the certification prep course should NOT, by itself, be considered sufficient preparation for any certification exam offered by Databricks.

Databricks Certification Exams

Databricks Certified Developer for Apache Spark 2.x validates your overall knowledge of Apache Spark and ensures employers that you are up-to-date with the fast-moving Apache project, with its significant features and enhancements being rolled out rapidly. For more information see the Databricks Certified Developer: Apache Spark™ 2.X

Databricks Certified Associate for Apache Spark 2.4 validates your knowledge of the core components of the DataFrames API and also validates that you have a rudimentary knowledge of the Spark Architecture. For more information, see the Databricks Certified Associate for Apache Spark 2.4

On-Site Testing
A testing room will be available from 11:45 a.m. – 5:00 p.m. on Wednesday and Thursday during the Summit. When registering, select which day you would like to take your exam. Entrance to the room will be on a rolling basis: as a seat becomes available, we will let the next person in.


  • Apache Spark is the gold standard of big-data tools and technologies; a certified professional can expect great pay packages.
  • Databricks certifications validate to employers your expertise in developing Spark applications in a production environment.
  • Databricks certifications will help you keep up-to-date with the latest enhancements in the Spark ecosystem.
  • As a Databricks certified professional, you can become an integral part of the growing Spark developer community.
  • Databricks certifications will enable you to meet global standards required to ensure compatibility between Spark applications and distributions.

Machine Learning in Production: MLflow and Model Deployment

In this course, data scientists and engineers learn best practices for putting machine-learning models into production. It starts with managing experiments, projects, and models using MLflow, then explores various deployment options, including batch predictions, Spark Streaming, and REST APIs. Finally, it covers monitoring machine-learning models once they have been deployed into production.


  • Python (numpy, pandas, sklearn)
  • Background in machine learning and data science
  • Basic knowledge of Spark DataFrames (recommended)

Topics Covered Include

  • Course Overview and Setup
  • Experiment Tracking
  • Packaging ML Projects
  • Model Management
  • Batch Deployment
  • Streaming Deployment
  • Deployment REST
  • Monitoring