This module allows you to quickly start using Apache Spark. We will be using Databricks so you can focus on the programming examples instead of spinning up and maintaining clusters and notebook infrastructure. As this is a quick start, we will be discussing the various concepts briefly so you can complete your end-to-end examples. In the “Additional Resources” section and other modules of this guide, you will have an opportunity to go deeper with the topic of your choice.
To write your first Apache Spark Job using Databricks, you will write your code in the cells of your Databricks notebook. In this example, we will be using Python. For more information, you can also reference the Apache Spark Quick Start Guide and the Databricks Guide. The purpose of this quick start is showcase RDD’s (Resilient Distributed Datasets) operations so that you will be able to understand the Spark UI when debugging or trying to understand the tasks being undertaken.
When running this first command, we are reviewing a folder within the Databricks File System (an optimized version of S3 – click here for more information) which contains your files.
# Take a look at the file system %fs ls /databricks-datasets/samples/docs/
In the next command, you will use the Spark Context to read the README.md text file
# Setup the textFile RDD to read the README.md file # Note this is lazy textFile = sc.textFile("/databricks-datasets/samples/docs/README.md")
And then you can count the lines of this text file by running the command
# Perform a count against the README.md file textFile.count()
One thing you may have noticed is that the first command, reading the textFile via the Spark Context (sc), did not generate any output while the second command (performing the count) did. The reason for this is because RDDs have actions (which returns values) as well as transformations (which returns pointers to new RDDs). The first command was a transformation while the second one was an action. This is important because when Spark performs its calculations, it will not execute any of the transformations until an action occurs. This allows Spark to optimize (e.g. run a filter prior to a join) for performance instead of following the commands serially.
To see what is happening when you run the
count() command, you can see the jobs and stages within the Spark Web UI. You can access this directly from the Databricks notebook so you do not need to change your context as you are debugging your Spark job.
For a deeper dive into the Spark DAG, please watch the video Tuning and Debugging in Apache Spark with Databricks co-founder and founding Committer and PMC member of the Apache® Spark™ project, Patrick Wendell.
As you can see from the below Jobs view, when performing the action
count() it also includes the previous transformation to access the text file.
What is happening under the covers becomes more apparent when reviewing the Stages view from the Spark UI (also directly accessible within your Databricks notebook). As you can see from the DAG visualization below, prior to the
PythonRDD  count() step, Spark will perform the task of accessing the file
( textFile) and running
MapPartitionsRDD  textFile.
As noted in the previous section, RDDs have actions which return values and transformations which return points to new RDDs. Transformations are lazy and executed when an action is run. Some examples include:
For a complete list, please refer to the Apache Spark Programming Guide: Transformations and Actions.
In many scenarios, especially with the performance optimizations embedded in DataFrames and Datasets, it will not be necessary to work with RDDs. But it is important to bring this up because:
Saying this, when developing Spark applications, you will typically use DataFrames and Datasets. As of Apache Spark 2.0, the DataFrame and Dataset APIs are merged together; a DataFrame is the Dataset Untyped API while what was known as a Dataset is the Dataset Typed API.
For more information, please reference: