Apache Spark Training - Databricks

TRAINING

Spark + AI Summit 2020 features a number of pre-conference training workshops that include a mix of instruction and hands-on exercises to help you improve your Apache Spark™ and Data Engineering skills. Learn how to leverage Apache Spark as your unified analytics engine for building data pipelines and machine learning. Expand upon your data sciences skills by better understanding the machine learning lifecycle with MLflow or diving into a deep learning tutorial with Keras and TensorFlow.

The training workshops are offered as add-ons to the Conference Pass.

Students will need to bring their own laptops with Chrome or Firefox browser and access to *.databricks.com.

Introduction to Unified Data Analytics

Role: Business Leader, Platform Administrator, SQL Analyst, Data Engineer, Data Scientist
Duration: Half Day

Discover how Unified Data Analytics solves some of the common business problems associated with Big Data. You’ll learn how to apply organizational best practices that will help your data teams work better together when they have a single source of truth.

Prerequisites:

  • No programming experience is required
  • A working knowledge of Apache Spark is helpful

Introduction to Delta Lake

Role: Business Leader, Platform Administrator, SQL Analyst, Data Engineer, Data Scientist
Duration: Half Day

This course describes the core features of Delta Lake. It covers how Delta Lake simplifies and optimizes data architecture and the engineering of data pipelines. Upon completion, participants will understand how Delta Lake brings reliability, performance, and lifecycle management to data lakes.

Prerequisites:

  • No programming experience is required
  • A working knowledge of Apache Spark is helpful

Databricks Platform Administration

Role: Platform Administrator
Duration: Half Day

Learn how to manage a Databricks account at an organizational level. Participants will come away knowing how to manage users and groups, including provisioning, access control and workspace storage.

Prerequisites:

  • No programming experience is required
  • A working knowledge of Apache Spark is helpful

Introduction to Apache Spark Programming

Role: Data Engineer, Data Scientist
Duration: Full Day

This hands-on course covers the fundamentals of Apache Spark(™) programming, providing the essential concepts and skills you’ll need to navigate the Spark documentation and immediately start programming. Using case studies, you’ll explore the core components of the DataFrame API. Students will read and write data to various sources, preprocess data by correcting schemas and parsing different data types, and apply a variety of DataFrame transformations and actions to answer business questions.

Prerequisites:

  • No experience with Apache Spark is required
  • Intermediate to advanced programming experience in Python or Scala
  • This class is taught concurrently in Python and Scala

Spark-SQL with Databricks

Role: SQL Analyst
Duration: Full Day

This hands-on course shows learners how to use Spark-SQL in the Databricks environment. You’ll learn how to read, transform and write data using the SQL extensions provided by Apache Spark in addition to a cursory introduction to key topics unique to working with a distributed system like Apache Spark vs traditional RDBMs.

Prerequisites:

  • No experience with Apache Spark is required
  • Intermediate to advanced experience with ANSI SQL
  • This class is taught in SQL only

Apache Spark Tuning and Best Practices

Role: Data Engineer
Duration: Full Day

Take a deep dive into the processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications.

Prerequisites:

  • 6 months or more experience with Apache Spark is recommended
  • Intermediate to advanced experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala
  • This class is taught concurrently in Python and Scala

Building Better Data Pipelines for Apache Spark with Delta Lake

Role: Data Engineer
Duration: Full Day

Delta Lake is designed to overcome many problems associated with traditional data lake pipelines and enable ACID transactions on data lakes. This course explores tools and tricks you can use to transform your current data lake pipeline into a highly performant Delta Lake pipeline.

Prerequisites:

  • 6 months or more experience with Apache Spark is recommended
  • Intermediate to advanced experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala
  • This class is taught concurrently in Python and Scala

Structured Streaming with Databricks

Role: Data Engineer
Duration: Full Day

Structured streaming is a highly efficient way to ingest data from a variety of sources. This hands-on course targets Data Engineers who want to process big data using Apache Spark™ Structured Streaming.

Prerequisites:

  • Beginner to intermediate experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala
  • This class is taught concurrently in Python and Scala

Apache Spark for Machine Learning and Data Science

Role: Data Scientist
Duration: Full Day

This course focuses on Apache Spark’s machine learning APIs. Students will learn the core APIs for using Spark, SQL and other high-level data access tools, and Spark’s streaming capabilities. It is delivered as a mixture of lecture and hands-on labs.

Prerequisites:

  • Intermediate to advanced programming experience in Python or Scala
  • Beginner to intermediate experience with the DataFrames API

Applying Deep Learning with Keras, TensorFlow and Apache Spark

Role: Data Scientist
Duration: Full Day

Taught entirely in Python, this course offers a thorough overview of deep learning and how to scale it with Apache Spark. Students will learn the fundamentals of neural networks and how to build distributed deep learning models on top of Spark. Includes hands-on training with Keras, TensorFlow, MLflow, and Horovod to build, tune and apply models.

Prerequisites:

  • Experience programming in Python and PySpark
  • Working knowledge of ML concepts (e.g. regression, classification, evaluation metrics, etc.)

Introduction to Reinforcement Learning

Role: Data Scientist
Duration: Full Day

Have you ever wondered how computers beat humans in Atari games or the ancient game of Go? Are you tired of the shortcomings of supervised and unsupervised learning? If you answered yes to any of these questions, this course is for you. This course combines theoretical and hands-on aspects of Reinforcement Learning. Upon completion of this course student will be able to:

  • Formulate a Reinforcement Learning problem and its associated vocabulary
  • Understand the difference between Supervised, Unsupervised, and Reinforcement Learning
  • Understand Markov Decision Processes (MDPs)
  • Implement Model-based RL, Policy Iteration, and Value Iteration
  • Understand and implement Monte-Carlo Model-Free Prediction and Control

Prerequisites:

  • Experience with advanced programming constructs of Python (i.e. one should be able to write classes, extend a class, etc.)
  • Practical experience with Supervised and Unsupervised learning
  • Understanding of Probability Theory and Linear Algebra

Machine Learning in Production: MLflow and Model Deployment

Role: Data Scientist
Duration: Full Day

In this hands-on course, data scientists and data engineers learn best practices for managing experiments, projects and models using MLflow. Students build a pipeline to log and deploy machine learning models.

Prerequisites:

  • Experience programming in Python
  • Working knowledge of ML concepts (e.g. regression, classification, evaluation metrics, etc.)

Distributed Machine Learning in SparkR/sparklyr

Role: Data Scientist
Duration: Full Day

In this course students will learn how to apply machine learning techniques in a distributed environment using SparkR and sparklyr. Students will learn about the Spark architecture, Spark DataFrame APIs, build linear and tree-based models, and perform hyperparameter tuning and pipeline optimization. The class is a combination of lectures, demos and hands-on labs.

Prerequisite:

  • Experience programming in R

Natural Language Processing at Scale

Role: Data Scientist
Duration: Half Day

This course will teach you how to do natural language processing at scale. You will apply libraries such as NLTK and Gensim in a distributed setting as well as SparkML/MLlib to solve classification, sentiment analysis, and text wrangling tasks. You will apply pre-trained word embeddings, identify when to lemmatize vs stem your tokens, and generate term-frequency-inverse-document-frequency (TFIDF) vectors for your dataset. You will also use dimensionality reduction techniques to visualize word embeddings with Tensorboard and apply basic vector arithmetic to embeddings. This course is intended for people who are new to NLP.

Prerequisite:

  • Experience programming in Python

Practical Problem-solving in Finance: Real-time Fraud Detection with Spark

Role: SQL Analyst, Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, you will learn how Databricks and Spark can help solve real-world problems you face when working with financial data. You’ll learn how to deal with dirty data and how to get started with Structured Streaming and Real-Time Fraud Detection. Students will also receive a longer take-home capstone exercise as bonus content to the class where they can apply all the concepts presented.

Prerequisite:

  • Beginner to intermediate experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala
  • This class is taught concurrently in Python and Scala

Practical Problem-solving in Retail: Real-time Fraud Detection with Spark

Role: SQL Analyst, Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, you will learn how Databricks and Spark can help solve real-world problems you face when working with retail data. You’ll learn how to deal with dirty data, and get started with Structured Streaming and Dashboards. Students will also receive a longer take-home capstone exercise as bonus content to the class where they can apply all the concepts presented.

Prerequisite:

  • Beginner to intermediate experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala
  • This class is taught concurrently in Python and Scala

Practical Problem-solving in Healthcare: Machine Learning and BI with Spark

Role: SQL Analyst, Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, you will learn how Databricks and Spark can help solve real-world problems you face when working with life sciences data. You’ll learn how to deal with dirty data, create dashboards and get started with MLLib. Students will also receive a longer take-home capstone exercise as bonus content to the class where you can test all the concepts presented.

Prerequisite:

  • Beginner to intermediate experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala
  • This class is taught concurrently in Python and Scala

Practical Problem-solving in Manufacturing: Real-Time Data Processing and Optimizations with Spark

Role: SQL Analyst, Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, students will learn how Databricks and Spark can help solve real-world problems you face when working with manufacturing data. Students will learn how to deal with dirty data, optimize data sources and transformations, and to get started with Structured Streaming. Students will also receive a longer take-home capstone exercise as bonus content to the class where you can test all the concepts presented.

Prerequisite:

  • Beginner to intermediate experience with the DataFrames API
  • Intermediate to advanced programming experience in Python or Scala
  • This class is taught concurrently in Python and Scala

Certification Prep: Databricks Certified Associate Developer for Apache Spark 2.4

Role: Data Engineer, Data Scientist
Duration: Half Day

Role: Data Engineer, Data Scientist
Duration: Half Day

In this half-day course, learners will review fundamentals of Spark Architecture components and concepts, core components of the DataFrames and how to access and use documentation during the exam. Students will prepare to complete a series of multiple choice questions and coding challenges that demonstrate an understanding of Spark developer basics.

Prerequisite: None