This self-paced guide is the “Hello World” tutorial of Apache Spark™ using Databricks (try Databricks here). In the following chapters, you will familiarize yourself with the Spark UI, learn how to create Spark jobs, load data and work with Datasets, get familiar with Spark’s DataFrames API, run machine learning algorithms, and understand the basic concepts behind Spark Streaming. Instead of worrying about spinning up clusters, maintaining clusters, maintaining code history, or Spark versions, you can start writing Spark queries instantly and focus on your data problems.
Hover over the above navigation bar and you will see the six stages to getting started with Apache Spark on Databricks. This guide will first provide a quick start on how to use open source Apache Spark and then leverage this knowledge to learn how to use Spark DataFrames with Spark SQL. In time for Spark 2.0, we also will discuss how to use Datasets and how DataFrames and Datasets are now unified. The guide also has quick starts for Machine Learning and Streaming so you can easily apply them to your data problems. Each of these modules refers to standalone usage scenarios—including IoT and home sales—with notebooks and datasets so you can jump ahead if you feel comfortable.
Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics.
“At Databricks, we’re working hard to make Spark easier to use and run than ever, through our efforts on both the Spark codebase and support materials around it. All of our work on Spark is open source and goes directly to Apache.”
Matei Zaharia, VP, Apache Spark,
Co-founder & Chief Technologist, Databricks
For more information about Spark, you can also reference:
Databricks is a Unified Analytics Platform on top of Apache Spark that accelerates innovation by unifying data science, engineering and business. With our fully managed Spark clusters in the cloud, you can easily provision clusters with just a few clicks. Databricks incorporates an integrated workspace for exploration and visualization so users can learn, work, and collaborate in a single, easy to use environment. You can easily schedule any existing notebook or locally developed Spark code to go from prototype to production without re-engineering.Sign up Today
In addition, Databricks includes: