Overview

To access all the code examples in this stage, please import the Quick Start using Python or Quick Start using Scala notebooks.

This module allows you to quickly start using Apache Spark. We will be using Databricks so you can focus on the programming examples instead of spinning up and maintaining clusters and notebook infrastructure. As this is a quick start, we will be discussing the various concepts briefly so you can complete your end-to-end examples. In the “Additional Resources” section and other modules of this guide, you will have an opportunity to go deeper with the topic of your choice.

Writing your first Apache Spark Job

To write your first Apache Spark Job using Databricks, you will write your code in the cells of your Databricks notebook. In this example, we will be using Python. For more information, you can also reference the Apache Spark Quick Start Guide and the Databricks Guide. The purpose of this quick start is showcase RDD’s (Resilient Distributed Datasets) operations so that you will be able to understand the Spark UI when debugging or trying to understand the tasks being undertaken.

When running this first command, we are reviewing a folder within the Databricks File System (an optimized version of S3 – click here for more information) which contains your files.

# Take a look at the file system
%fs ls /databricks-datasets/samples/docs/

DBFS-readme-sm

In the next command, you will use the Spark Context to read the README.md text file

# Setup the textFile RDD to read the README.md file
# Note this is lazy
textFile = sc.textFile("/databricks-datasets/samples/docs/README.md")

And then you can count the lines of this text file by running the command

# Perform a count against the README.md file
textFile.count()

Output of textFile.count()

One thing you may have noticed is that the first command, reading the textFile via the Spark Context (sc), did not generate any output while the second command (performing the count) did. The reason for this is because RDDs have actions (which returns values) as well as transformations (which returns pointers to new RDDs). The first command was a transformation while the second one was an action. This is important because when Spark performs its calculations, it will not execute any of the transformations until an action occurs. This allows Spark to optimize (e.g. run a filter prior to a join) for performance instead of following the commands serially.

Apache Spark DAG

To see what is happening when you run the count() command, you can see the jobs and stages within the Spark Web UI. You can access this directly from the Databricks notebook so you do not need to change your context as you are debugging your Spark job.

For a deeper dive into the Spark DAG, please watch the video Tuning and Debugging in Apache Spark with Databricks co-founder and founding Committer and PMC member of the Apache® Spark™ project, Patrick Wendell.

As you can see from the below Jobs view, when performing the action count() it also includes the previous transformation to access the text file.

Screenshot of a Spark DAG visualization of a Job in Databricks

What is happening under the covers becomes more apparent when reviewing the Stages view from the Spark UI (also directly accessible within your Databricks notebook). As you can see from the DAG visualization below, prior to the PythonRDD [1333] count() step, Spark will perform the task of accessing the file ([1330] textFile) and running MapPartitionsRDD [1331] textFile.

Screenshot of a Spark DAG visualization of a Stage in Databricks

RDDs, Datasets, and DataFrames

As noted in the previous section, RDDs have actions which return values and transformations which return points to new RDDs. Transformations are lazy and executed when an action is run. Some examples include:

  • Transformations: map(), flatMap(), filter(), mapPartitions(), mapPartitionsWithIndex(), sample(), union(), distinct(), groupByKey(), reduceByKey(), sortByKey(), join(), cogroup(), pipe(), coalesce(), repartition(), partitionBy(), …
  • Actions: reduce(), collect(), count(), first(), take(), takeSample(), takeOrdered(), saveAsTextFile(), saveAsSequenceFile(), saveAsObjectFile(), countByKey(), foreach(), …

For a complete list, please refer to the Apache Spark Programming Guide: Transformations and Actions.

In many scenarios, especially with the performance optimizations embedded in DataFrames and Datasets, it will not be necessary to work with RDDs. But it is important to bring this up because:

  • RDDs are the underlying infrastructure that allows Spark to run so fast (in-memory distribution) and provide data lineage.
  • If you are diving into more advanced components of Spark, it may be necessary to utilize RDDs.
  • All the DAG visualizations within the Spark UI reference RDDs.

Saying this, when developing Spark applications, you will typically use DataFrames and Datasets. As of Apache Spark 2.0, the DataFrame and Dataset APIs are merged together; a DataFrame is the Dataset Untyped API while what was known as a Dataset is the Dataset Typed API.

For more information, please reference:

To access all the code examples in this stage, please import the Quick Start using Python or Quick Start using Scala notebooks.