Apache Spark Training - Databricks


Spark + AI Summit 2018 features a number of 1-day training workshops that include a mix of instruction and hands-on exercises to help you improve your Apache Spark™ skills.

Training is offered as an add-on to the Conference Pass.

Students will need to bring their own laptop with Chrome or Firefox Browsers and unfettered access to databricks.com.

Data Science with Apache Spark™ (SOLD OUT)

The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. It is designed for software developers, data analysts, data engineers, and data scientists.

It will also cover parallelizing machine learning algorithms at a conceptual level. The workshop will take a pragmatic approach, with a focus on using Apache Spark for data analysis and building models using MLlib, while limiting the time spent on machine learning theory and the internal workings of Spark.

We will work through examples using public datasets that will show you how to apply Apache Spark to help you iterate faster and develop models on massive datasets. This workshop will provide you with tools to be productive using Spark on practical data analysis tasks and machine learning problems. After completing this workshop you should be comfortable using DataFrames, the DataFrames MLlib API, and related documentation. These building blocks will enable you to use Apache Spark to solve a variety of data analysis and machine learning tasks.


Some experience coding in Python or Scala and a basic understanding of data science topics and terminology are recommended. Experience using Spark and familiarity with the concept of a DataFrame is helpful.

Brief conceptual reviews of data science techniques will be performed before the techniques are used. Labs and demos will be available in both Python and Scala.


  • Extract, Transform, Load (ETL) and Exploratory Data Analysis (EDA)
  • DataFrames
  • Feature Extraction and Transformation using MLlib
  • MLlib Pipelines: Transformers and Estimators
  • Cross validation
  • Model Parallel vs Data Parallel
  • Clustering, Classification, and Regression
  • Logistic Regression, Decision Trees, Random Forests and ensembles, and Deep Learning Pipelines
  • Evaluation Metrics

Understand and Apply Deep Learning with Keras, TensorFlow & Apache Spark (SOLD OUT)

Instructor: Adam Breindel

This Deep Learning workshop introduces the conceptual background as well as implementation for key architectures in neural network machine learning models. We will see how and why deep learning has become such an important and popular technology, and how it is similar to and different from other machine learning models as well as earlier attempts at neural networks.

We’ll see how deep learning models can be used to enhance your traditional business analytics, in addition to covering the famous cases like image recognition, language processing, and autonomous agents. Most of our models will be built with the Keras API/Library, but we’ll also take a look at “what’s under the hood” with TensorFlow. But we won’t just hack demos: our goal is to develop an intuition for the key concepts and issues at play in deep learning.

The class will also feature a discussion about using Apache Spark for training and inference, and other deployment / operational concerns. Along the way, we’ll hopefully explain enough ideas and terminology that you’ll be comfortable going further with deep learning on your own!


Familiarity with the basics of Python and with common ideas and techniques in machine learning / predictive analytics. You should be be familiar with classification vs. regression problems, supervised vs. unsupervised learning, bias-variance tradeoff, and common evaluation metrics like RMSE, precision, and recall.

No prior deep learning knowledge, vector calculus, or Spark experience is required.


  • Neural nets before and after the 2006 revolution
  • Perceptrons and Deep Feed-Forward Networks
  • Capturing information and choosing error functions
  • Convolutional Networks
  • How networks are trained and what can go wrong
  • Recurrent Networks
  • Reinforcement Learning
  • Deep Learning inference at scale with Apache Spark
  • Approaches to distributed training, including Spark

Apache Spark Tuning and Best Practices (SOLD OUT)

This 1-day course is for data engineers, analysts, architects, dev-ops, and team-leads interested in troubleshooting and optimizing Apache Spark applications. It covers troubleshooting, tuning, best practices, anti-patterns to avoid, and other measures to help tune and troubleshoot Spark applications and queries.

Each topic includes lecture content along with hands-on use of Spark through an elegant web-based notebook environment. Inspired by tools like IPython/Jupyter, notebooks allow attendees to code jobs, data analysis queries, and visualizations using their own Spark cluster, accessed through a web browser. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering; all examples are guaranteed to run in that environment. Alternatively, each notebook can be exported as source code and run within any Spark environment.


  • Project experience or with Apache Spark
  • Apache Spark™ Programming – DB 105 or equivalent
  • Basic programming experience in an object oriented or functional language is required. The class will be taught in a mixture of Python and Scala.


  • The role of memory in Spark applications
  • How to use broadcast variables and, in particular, broadcast joins to increase the performance of DataFrame operations
  • The Catalyst query optimizer
  • How to tune Spark’s partitioning and shuffling behavior
  • How best to size a Spark cluster for different kinds of workflows

Apache Spark Essentials (SOLD OUT)

This 1-day course is for data engineers, analysts, architects, data scientist, software engineers, IT operations, and technical managers interested in a brief hands-on overview of Apache Spark.

The course provides an introduction to the Spark architecture, some of the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.


  • Using a subset of the core Spark APIs to operate on data.
  • Articulating and implementing simple use cases for Spark
  • Building data pipelines and query large data sets using Spark SQL and DataFrames
  • Creating Structured Streaming jobs
  • Understanding how a Machine Learning pipeline works
  • Understanding the basics of Spark’s internals

1/2 Prep-course + Databricks Developer Certification: Apache Spark 2.x

This 1/2 day lecture is for anyone seeking to become a Databricks Certified Apache Spark Developer or Databricks Certified Apache Spark Systems Architect. It includes test-taking strategies, sample questions, preparation guidelines and exam requirements. The primary goal of this course is to help potential applicants understand the breadth and depth to which individuals will be tested and to provide guidelines as to how to prepare for the exam.

Each topic includes lecture content and reference material presented in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends.

Attendees who select the prep course will take the exam after the course is completed.


  • Introduction to the Databricks environment
  • Test-taking strategies specific to the certification exam
  • Sample exam questions
  • Suggested study materials consisting of books, MOOCs, self-paced courses, instructor-led training, etc.
  • Exam-specific prerequisites
  • One attempt at the certification exam

Please Note: attending the certification prep course should NOT, by itself, be considered sufficient preparation for successfully taking the Databricks Apache Spark certification exam.

Databricks Developer Certification exam on Apache Spark

Databricks Certified Developer for Apache Spark 2.x—validates your overall knowledge on Apache Spark and ensures employers that you are up-to-date with the fast moving apache project with significant features and enhancements being rolled out rapidly. The test is about 90 minutes with a series of randomly generated questions.

A testing room will be available from 11:45 am- 5:00 pm on Tuesday and Wednesday during the Summit. When registering, you will select the day when you would like to take your exam.  Entrance to the room will be on a rolling basis. As a seat becomes available we will let the next person in.

No outside phones or computers will be allowed in the testing room.  We will provide a computer for the exam.


  • Apache Spark is the gold Standard of the Big Data tools and technologies, and a certified professional with Databricks Certification for Apache Spark 2.x can expect great pay packages.
  • Databricks Certification for Apache Spark 2.x validates your expertise to employers in developing Spark applications in the production environment.
  • Databricks Certification for Apache Spark 2.x will help you keep up to date with the latest enhancements in the Spark ecosystem.
  • As a Databricks Certified Developer you can become an integral part of the growing Spark developer community.
  • A Databricks Certified Developer will enable you to meet global standards required to ensure compatibility between Spark application and distributions.