Overview

To access all the code examples in this stage, please import the Population vs. Price DataFrames notebook.

Apache Spark DataFrames were created to run Spark programs faster from both a developer and an execution perspective. With less code to write and less data to read, the Catalyst optimizer solves common problems efficiently and faster using DataFrame functions (e.g. select columns, filtering, joining different data sources, aggregation, etc.). DataFrames also allow you to seamlessly intermix operations with custom SQL, Python, Java, R, or Scala code.

Accessing the sample data

The easiest way to work with DataFrames is to access an example dataset. We have made a number of datasets available in the /databricks-datasets folder which is accessible within the Databricks platform. For example, to access the file that compares city population vs. median sale prices of homes, you can access the file /databricks-datasets/samples/population-vs-price/data_geo.csv.

We will use the spark-csv package from Spark Packages (a community index of packages for Apache Spark) to quickly import the data, specify that a header exists, and infer the schema.

Note, the spark-csv package is embedded into Spark 2.0.

# Use the Spark CSV datasource with options specifying:
# - First line of file is a header
# - Automatically infer the schema of the data
data = sqlContext.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values

# Register table so it is accessible via SQL Context
# For Apache Spark = 2.0
data.createOrReplaceTempView("data_geo")

Viewing the DataFrame

Now that you have created the data DataFrame, you can quickly access the data using standard Spark commands such as take(). For example, you can use the command data.take(10) to view the first ten rows of the data DataFrame.

Output of the DataFrame take() command

To view this data in a tabular format, instead of exporting this data out to a third party tool, you can use the display() command within Databricks.

Displaying a DataFrame in Databricks

Visualizing your DataFrame

An additional benefit of using the Databricks display() command is that you can quickly view this data with a number of embedded visualizations. For example, in a new cell, you can specify the following SQL query and click on the map.

%sql select `State Code`, `2015 median sales price` from data

Visualize a DataFrame in Databricks on a map

Below is an animated gif showing how quickly you can go from table to map using DataFrames and the Databricks display() command.

Visualizing a DataFrame in Databricks

To access all the code examples in this stage, please import the Population vs. Price DataFrames notebook.