Skip to main content

Apache Spark™ Tutorial: Getting Started with Apache Spark on Databricks

Welcome

This self-paced guide is the “Hello World” tutorial for Apache Spark using Databricks. In the following tutorial modules, you will learn the basics of creating Spark jobs, loading data, and working with data. You’ll also get an introduction to running machine learning algorithms and working with streaming data. Databricks lets you start writing Spark queries instantly so you can focus on your data problems.

Hover over the above navigation bar and you will see the six stages to getting started with Apache Spark on Databricks. This guide will first provide a quick start on how to use open source Apache Spark and then leverage this knowledge to learn how to use Spark DataFrames with Spark SQL. We also will discuss how to use Datasets and how DataFrames and Datasets are now unified. The guide also has quick starts for Machine Learning and Streaming so you can easily apply them to your data problems. Each of these modules refers to standalone usage scenarios—including IoT and home sales—with notebooks and datasets so you can jump ahead if you feel comfortable.

Introduction to Apache Spark

spark logo

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics.

“At Databricks, we’re working hard to make Spark easier to use and run than ever, through our efforts on both the Spark codebase and support materials around it. All of our work on Spark is open source and goes directly to Apache.”

 —Matei Zaharia, VP, Apache Spark, Co-founder & Chief Technologist, Databricks

For more information about Spark, you can also reference:

What is Spark

Latest Spark Overview

Get Databricks

Databricks is a Unified Analytics Platform on top of Apache Spark that accelerates innovation by unifying data science, engineering and business. With our fully managed Spark clusters in the cloud, you can easily provision clusters with just a few clicks. Databricks incorporates an integrated workspace for exploration and visualization so users can learn, work, and collaborate in a single, easy to use environment. You can easily schedule any existing notebook or locally developed Spark code to go from prototype to production without re-engineering.

In addition, Databricks includes:

  • Our award-winning Massive Open Online Course, “Introduction to Big Data with Apache Spark” which has enrolled over 76,000 participants to date!
  • Massive Open Online Courses (MOOCs), including Machine Learning with Apache Spark
  • Analysis Pipelines Samples in R and Scala

Find all of our available courses here at https://www.databricks.com/learn/training/home

Additional Resources

Continue to next module: