Overview

To access all the code examples in this stage, please import the Population vs. Price Linear Regression notebook.

As organizations create more diverse and more user-focused data products and services, there is a growing need for machine learning, which can be used to develop personalizations, recommendations, and predictive insights. Apache Spark’s Machine Learning Library (MLlib) allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on).

Accessing the sample data

The easiest way to work with DataFrames is to access an example dataset. We have made a number of datasets available in the /databricks-datasets folder which is accessible from Databricks. For example, to access the file that compares city population vs. median sale prices of homes, you can access the file /databricks-datasets/samples/population-vs-price/data_geo.csv.

We will use the spark-csv package from Spark Packages (a community index of packages for Apache Spark) to quickly import the data, specify that a header exists, and infer the schema.

Note, the spark-csv package is embedded into Spark 2.0.

# Use the Spark CSV datasource with options specifying:
# - First line of file is a header
# - Automatically infer the schema of the data
data = sqlContext.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values

# Register table so it is accessible via SQL Context
# For Apache Spark = 2.0
data.createOrReplaceTempView("data_geo")

To view this data in a tabular format, instead of exporting this data out to a third party tool, you can use the display() command within Databricks.

Displaying a DataFrame in Databricks

Prepare and visualize data for ML algorithms

In supervised learning—such as a regression algorithm—you typically will define a label and a set of features. In our linear regression example, the label is the 2015 median sales price while the feature is the 2014 Population Estimate. That is, we are trying to use the feature (population) to predict the label (sales price). To simplify the creation of features within Python Spark MLlib, we use LabeledPoint to convert the feature (population) to a Vector type.

# convenience for specifying schema
from pyspark.mllib.regression import LabeledPoint

data = data.select("2014 Population estimate", "2015 median sales price")
  .map(lambda r: LabeledPoint(r[1], [r[0]]))
  .toDF()
display(data)

Our machine learning schema

Executing Linear Regression Model

In this section, we will execute two different linear regression models using different regularization parameters and determine its efficacy. That is, how well do either of these two models predict the sales price (label) based on the population (feature).

Building the model

# Import LinearRegression class
from pyspark.ml.regression import LinearRegression

# Define LinearRegression algorithm
lr = LinearRegression()

# Fit 2 models, using different regularization parameters
modelA = lr.fit(data, {lr.regParam:0.0})
modelB = lr.fit(data, {lr.regParam:100.0})

Using the model, we can also make predictions by using the transform() function which adds a new column of predictions. For example, the code below takes the first model (modelA) and shows you both the label (original sales price) and prediction (predicted sales price) based on the features (population).

# Make predictions
predictionsA = modelA.transform(data)
display(predictionsA)

List of predictions

Evaluating the Model

To evaluate the regression analysis, we will calculate the root mean square error using the RegressionEvaluator. Below is the pySpark code for evaluating the two models and their output.

from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse")
RMSE = evaluator.evaluate(predictionsA)
print("ModelA: Root Mean Squared Error = " + str(RMSE))

# ModelA: Root Mean Squared Error = 128.602026843
predictionsB = modelB.transform(data)
RMSE = evaluator.evaluate(predictionsB)
print("ModelB: Root Mean Squared Error = " + str(RMSE))

# ModelB: Root Mean Squared Error = 129.496300193

As is typical for many machine learning algorithms, you will want to visualize the scatterplot. As Databricks supports Python pandas and ggplot, the code below creates a linear regression plot using Python Pandas DataFrame (pydf) and ggplot to display the scatterplot and the two regression models.

# Import numpy, pandas, and ggplot
import numpy as np
from pandas import *
from ggplot import *

# Create Python DataFrame
pop = data.map(lambda p: (p.features[0])).collect()
price = data.map(lambda p: (p.label)).collect()
predA = predictionsA.select("prediction").map(lambda r: r[0]).collect()
predB = predictionsB.select("prediction").map(lambda r: r[0]).collect()

pydf = DataFrame({'pop':pop,'price':price,'predA':predA, 'predB':predB})

Visualizing the Model

# Create scatter plot and two regression models (scaling exponential) using ggplot
p = ggplot(pydf, aes('pop','price')) +
geom_point(color='blue') +
geom_line(pydf, aes('pop','predA'), color='red') +
geom_line(pydf, aes('pop','predB'), color='green') +
scale_x_log10() + scale_y_log10()
display(p)

Scatterplot visualization of the regression model

To access all the code examples in this stage, please import the Population vs. Price Linear Regression notebook.