Training

Register

Filter by role

Learning paths

Register

Spark + AI Summit 2020 features a number of pre-conference training workshops that include a mix of instruction and hands-on exercises to help you improve your Apache Spark™ and Data Engineering skills.

Introduction to Unified Data Analytics for Managers
(contact your Databricks account rep for a code to register)

Role: Business Leaders
Duration: Half Day

Discover Databricks and how it allows your data teams to stop working in silos, simplify data preparation, allow an agile AI ecosystem, and stop infrastructure from getting in the way. In this course, we’ll review foundational big data concepts, explore why many organizations are struggling with achieving true artificial intelligence, and dive into how the components of the Unified Data Analytics platform can be used to overcome those challenges.

Prerequisites:

  • No programming experience is required
  • Contact your Databricks account rep for a code to register

Introduction to Delta Lake

Role: Business Leader, Platform Administrator, SQL Analyst, Data Engineer, Data Scientist
Duration: Half Day

Learn about what Delta Lake is, how it simplifies and optimizes data architecture, and the engineering of data pipelines. This course dives into the core features of Delta Lake and how they bring reliability, performance, and lifecycle management to data lakes.

Prerequisites:

  • No programming experience is required

Databricks Administration

Role: Platform Administrator
Duration: Half Day

Learn administration and security best practices for managing your Databricks workspace. In this course, we’ll guide you through using the Admin Console to manage users and workspace storage, configure access control for your workspace, clusters, pools, and jobs, and apply cluster provisioning strategies and usage management features to maximize usability and cost effectiveness in different scenarios. Then, we’ll cover data protection features and configure data access control with Databricks best practices. Lastly, we’ll describe the Databricks platform architecture and deployment models, as well as the network security and compliance features for each.

Prerequisites:

  • No programming experience is required

Introduction to Apache Spark™ Programming

Role: Data Engineer, Data Scientist
Duration: Half Day

Learn the fundamentals of Spark programming in a case study driven course that explores the core components of the DataFrame API. You’ll read and write data to various sources, preprocess data by correcting schemas and parsing different data types, and apply a variety of DataFrame transformations and actions to answer business questions. This course is designed to provide the essential concepts and skills you’ll need to navigate the Spark documentation and start programming immediately. This class is taught in Python/Scala.

Prerequisites:

  • No experience with Apache Spark is required
  • Basic familiarity programming in Python or Scala

SQL on Databricks

Role: SQL Analyst
Duration: Half Day

Learn how to leverage SQL on Databricks to easily discover insights on big data. The Databricks workspace provides a powerful data processing environment where data professionals can follow traditional data analysis workflows including exploring, visualizing, and preparing data for sharing with stakeholders. This course is designed to get you started using Databricks functionality to gain shareable insights on data. This class is taught in SQL only.

Prerequisites:

  • No experience with Apache Spark is required
  • Basic familiarity with ANSI SQL

Apache Spark Tuning and Best Practices

Role: Data Engineer
Duration: Half Day

Learn and implement best practices for tuning while diagnosing and fixing various performance problems. You’ll complete guided coding challenges and refactor existing code to increase overall performance by applying the best practices you’ve learned. This class is taught in Python/Scala.

Prerequisites:

  • 6+ months experience working with the Spark DataFrame API is recommended
  • Intermediate programming experience in Python or Scala

Building Better Data Pipelines for Apache Spark with Delta Lake

Role: Data Engineer
Duration: Half Day

Learn to build robust data pipelines using Apache Spark and Delta Lake on Databricks, performing ETL, data cleansing, and data aggregation. Delta Lake is designed to overcome many problems associated with traditional data lake pipelines.

Prerequisites:

  • 6+ months experience working with the Spark DataFrame API is recommended
  • Intermediate programming experience

Structured Streaming with Databricks

Role: Data Engineer
Duration: Half Day

Learn how to use Structured Streaming to ingest data from files and publisher-subscribe systems. You’ll learn the fundamentals of streaming systems, how to read, write, and display streaming data, and how Structured Streaming is used with Databricks Delta. You’ll then use a publish-subscribe system to stream data and visualize meaningful insights. This class is taught concurrently in Python and Scala.

Prerequisites:

  • Beginner experience with the DataFrames API
  • Intermediate programming experience in Python or Scala

Apache Spark for Machine Learning and Data Science

Role: Data Scientist
Duration: Half Day

This course focuses on teaching distributed machine learning with Spark. Students will build and evaluate pipelines with MLlib, understand the differences between single node and distributed ML, and optimize hyperparameter tuning at scale. This class is taught concurrently in Python and Scala.

Prerequisites:

  • Intermediate programming experience in Python or Scala
  • Beginner experience with the DataFrame API
  • Basic understanding of Machine Learning concepts

Scaling Deep Learning with TensorFlow and Apache Spark

Role: Data Scientist
Duration: Half Day

This course offers a thorough overview of how to scale training and deployment of neural networks with Apache Spark. We guide students through building deep learning models with TensorFlow, perform distributed inference with Spark UDFs via MLflow, and train a distributed model across a cluster using Horovod. This course is taught entirely in Python.

Prerequisites:

  • Experience programming in Python and PySpark
  • Basic understanding of Machine Learning concepts
  • Prior experience with Keras/TensorFlow highly encouraged

Introduction to Reinforcement Learning

Role: Data Scientist
Duration: Half Day

In this course you will learn Reinforcement Learning theory and get hands-on practice. Upon completion of this course, you understand the differences between supervised, unsupervised, reinforcement learning, and understand Markov Decision Processes (MDPs) and Dynamic Programming. You will be able to formulate a reinforcement learning problem, and implement policy evaluation, policy iteration and value iteration algorithms in Python (using Dynamic Programing). This course is taught entirely in Python.

Prerequisites:

  • Experience with advanced programming constructs of Python (i.e. write classes, extend a class, etc.)
  • Practical experience with Supervised and Unsupervised learning
  • Understanding of Probability Theory and Linear Algebra

Model-Free Reinforcement Learning

Role: Data Scientist
Duration: Half Day

In this course you will learn model-free Reinforcement Learning theory and get hands-on practice. You will be able to formulate a reinforcement learning problem, and implement model-free Reinforcement Learning algorithms. In particular you will implement Monte-Carlo, TD and Sarsa algorithms for prediction and control tasks. This course is taught entirely in Python.

Prerequisites:

  • Experience with advanced programming constructs of Python (i.e. write classes, extend a class, etc.)
  • Practical experience with Supervised and Unsupervised learning
  • Understanding of Probability Theory and Linear Algebra
  • Familiarity with Dynamic Programing and Markov Decision Processes
  • Experience with OpenAI gym
  • Gentle Introduction to Reinforcement Learning or equivalent experience

MLflow: Managing the Machine Learning Lifecycle (SOLD OUT)

Role: Data Scientists and Data Engineers
Duration: Half Day

In this hands-on course, data scientists and data engineers learn the best practices for managing experiments, projects, models, and a production model registry using MLflow. By the end of this course, you will have built a pipeline to train, register, and deploy machine learning models using the environment they were trained with. This course is taught entirely in Python and pairs well with the Machine Learning Deployment course.

Prerequisite:

  • Experience programming in Python
  • Working knowledge of ML concepts

Machine Learning Deployment: 3 Model Deployment Paradigms, Monitoring, and Alerting (SOLD OUT)

Role: Data Scientists and Data Engineers
Duration: Half Day

In this hands-on course, data scientists and data engineers learn best practices for deploying machine learning models in these paradigms: batch, streaming, and real time using REST. It explores common production issues faced when deploying machine learning solutions and monitoring these models once they have been deployed into production. By the end of this course, you will have built the infrastructure to deploy and monitor machine learning models in various deployment scenarios. This course is taught entirely in Python and pairs well with the MLflow course.

Prerequisite:

  • Experience programming in Python
  • Working knowledge of ML concepts

Distributed Machine Learning in Apache SparkR/sparklyr

Role: Data Scientist
Duration: Half Day

In this course students will learn how to apply machine learning techniques in a distributed environment using SparkR and sparklyr. Students will learn about the Spark architecture, Spark DataFrame APIs, build ML models, and perform hyperparameter tuning and pipeline optimization. The class is a combination of lectures, demos and hands-on labs. This course is taught entirely in R.

Prerequisite:

  • Experience programming in R

Natural Language Processing at Scale

Role: Data Scientist
Duration: Half Day

This course will teach you the fundamentals of natural language processing (NLP) and how to do it at scale. You will solve classification, sentiment analysis, and text wrangling tasks, by applying pre-trained word embeddings, generating term-frequency-inverse-document-frequency (TFIDF) vectors for your dataset, and using dimensionality reduction techniques, and many more. This course is taught entirely in Python.

Prerequisite:

  • Experience programming in Python

Practical Problem-solving in Finance: Real-time Data Analytics with Apache Spark

Role: Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, you will learn how Databricks and Spark can help solve real-world problems one faces when working with financial data. You’ll learn how to deal with dirty data and how to get started with Structured Streaming and real-time analytics. Students will also receive a longer take-home capstone exercise as bonus content to the class where they can apply all the concepts presented. This class is taught concurrently in Python and Scala.

Prerequisite:

  • Beginner to intermediate experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala

Practical Problem-solving in Retail: Real-time Data Analytics with Apache Spark

Role: Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, you will learn how Databricks and Spark can help solve real-world problems you face when working with retail data. You’ll learn how to deal with dirty data, and get started with Structured Streaming and real-time analytics. Students will also receive a longer take-home capstone exercise as bonus content to the class where they can apply all the concepts presented. This class is taught concurrently in Python and Scala.

Prerequisite:

  • Beginner to intermediate experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala

Practical Problem-solving in Healthcare: Real-time Data Analytics with Apache Spark

Role: Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, you will learn how Databricks and Spark can help solve real-world problems you face when working with healthcare data. You’ll learn how to deal with dirty data, and get started with Structured Streaming and real-time analytics. Students will also receive a longer take-home capstone exercise as bonus content to the class where you can test all the concepts presented. This class is taught concurrently in Python and Scala.

Prerequisite:

  • Beginner to intermediate experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala

Practical Problem-solving in Manufacturing: Real-time Data Analytics with Apache Spark

Role: Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, students will learn how Databricks and Spark can help solve real-world problems you face when working with manufacturing data. Students will learn how to deal with dirty data, and to get started with Structured Streaming and real-time analytics. Students will also receive a longer take-home capstone exercise as bonus content to the class where you can test all the concepts presented.

Prerequisite:

  • Beginner to intermediate experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala
  • This class is taught concurrently in Python and Scala

Certification Prep: Databricks Certified Associate Developer for Apache Spark 2.4 (SOLD OUT)

Role: Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, students will familiarize themselves with the format of the Databricks Certified Associate Developer for Apache Spark 2.4 exam and tips for preparation. We will review what parts of the DataFrame API and Spark architecture are covered in the exam and the skills they need to prepare for the exam.

Prerequisite:

  • Intermediate experience with the DataFrames API in Python or Scala

SOLD OUT!
What’s New in Apache Spark 3.0?

Role: SQL Analyst, Data Engineer, Data Scientist
Duration: 90 minutes, repeated 4x

This course covers the new features in Spark 3.0. It focuses on updates to performance, monitoring, usability, stability, extensibility, PySpark, and SparkR. Students will also learn about backwards compatibility with 2.x and the considerations required for updating to Spark 3.0.

Prerequisite:

  • Familiarity with Apache Spark 2.x