TensorFlow™ on Databricks
Clustering and k-means
We now venture into our first application, which is clustering with the k-means algorithm. Clustering is a data mining exercise where we take a bunch of data and find groups of points that are similar to each other. K-means is an algorithm that is great for finding clusters in many types of datasets.
For more about cluster and k-means, see the scikit-learn documentation on its k-means algorithm or watch this video:
Generating Samples
First up, we are going to need to generate some samples. We could generate the samples randomly, but that is likely to either give us very sparse points, or just one big group - not very exciting for clustering.
Instead, we are going to start by generating three centroids, and then randomly choose (with a normal distribution) around that point. First up, here is a method for doing this:
import tensorflow as tf
import numpy as np
def create_samples(n_clusters, n_samples_per_cluster, n_features, embiggen_factor, seed):
np.random.seed(seed)
slices = []
centroids = []
# Create samples for each cluster
for i in range(n_clusters):
samples = tf.random_normal((n_samples_per_cluster, n_features),
mean=0.0, stddev=5.0, dtype=tf.float32, seed=seed, name="cluster_{}".format(i))
current_centroid = (np.random.random((1, n_features)) * embiggen_factor) - (embiggen_factor/2)
centroids.append(current_centroid)
samples += current_centroid
slices.append(samples)
# Create a big "samples" dataset
samples = tf.concat(slices, 0, name='samples')
centroids = tf.concat(centroids, 0, name='centroids')
return centroids, samples
Put this code in functions.py
The way this works is to create n_clusters
different centroids at random (using np.random.random((1, n_features))
) and using those as the centre points for tf.random_normal
. The tf.random_normal
function generates normally distributed random values, which we then add to the current centre point. This creates a blob of points around that center. We then record the centroids (centroids.append
) and the generated samples (slices.append(samples)
). Finally, we create “One big list of samples” using tf.concat
, and convert the centroids to a TensorFlow Variable as well, also using tf.concat
.
Saving this create_samples
method in a file called functions.py
allows us to import these methods into our scripts for this (and the next!) lesson. Create a new file called generate_samples.py
, which has the following code:
import tensorflow as tf
import numpy as np
from functions import create_samples
n_features = 2
n_clusters = 3
n_samples_per_cluster = 500
seed = 700
embiggen_factor = 70
np.random.seed(seed)
centroids, samples = create_samples(n_clusters, n_samples_per_cluster, n_features, embiggen_factor, seed)
model = tf.global_variables_initializer()
with tf.Session() as session:
sample_values = session.run(samples)
centroid_values = session.run(centroids)
This just sets up the number of clusters and features (I recommend keeping the number of features at 2, allowing us to visualise them later), and the number of samples to generate. Increasing the embiggen_factor will increase the “spread” or the size of the clusters. I chose a value here that provides good learning opportunity, as it generates visually identifiable clusters.
To visualise the results, lets create a plotting function using matplotlib
. Add this code to functions.py
:
def plot_clusters(all_samples, centroids, n_samples_per_cluster):
import matplotlib.pyplot as plt
#Plot out the different clusters
#Choose a different colour for each cluster
colour = plt.cm.rainbow(np.linspace(0,1,len(centroids)))
for i, centroid in enumerate(centroids):
#Grab just the samples fpr the given cluster and plot them out with a new colour
samples = all_samples[i*n_samples_per_cluster:(i+1)*n_samples_per_cluster]
plt.scatter(samples[:,0], samples[:,1], c=colour[i])
#Also plot centroid
plt.plot(centroid[0], centroid[1], markersize=35, marker="x", color='k', mew=10)
plt.plot(centroid[0], centroid[1], markersize=30, marker="x", color='m', mew=5)
plt.show()
Put this code in functions.py
All this code does is plots out the samples from each cluster using a different colour, and creates a big magenta X where the centroid is. The centroid is given as an argument, which will be handy later on.
Update the generate_samples.py
to import this function by adding from functions import plot_clusters
to the top of the file. Then, add this line of code to the bottom:
plot_clusters(sample_values, centroid_values, n_samples_per_cluster)
Running generate_samples.py
should now give you the following plot:
Initialisation
The k-means algorithm starts with the choice of the initial centroids, which are just random guesses of the actual centroids in the data. The following function will randomly choose a number of samples from the dataset to act as this initial guess:
def choose_random_centroids(samples, n_clusters):
# Step 0: Initialisation: Select `n_clusters` number of random points
n_samples = tf.shape(samples)[0]
random_indices = tf.random_shuffle(tf.range(0, n_samples))
begin = [0,]
size = [n_clusters,]
size[0] = n_clusters
centroid_indices = tf.slice(random_indices, begin, size)
initial_centroids = tf.gather(samples, centroid_indices)
return initial_centroids
Put this code in functions.py
This code first creates an index for each sample (using tf.range(0, n_samples
), and then randomly shuffles it. From there, we choose a fixed number (n_clusters
) of indices using tf.slice
. These indices correlated to our initial centroids, which are then grouped together using tf.gather
to form our array of initial centroids.
Add this new choose_random_centorids
function to functions.py
, and create a new script (or update your previous one) to the following:
import tensorflow as tf
import numpy as np
from functions import create_samples, choose_random_centroids, plot_clusters
n_features = 2
n_clusters = 3
n_samples_per_cluster = 500
seed = 700
embiggen_factor = 70
centroids, samples = create_samples(n_clusters, n_samples_per_cluster, n_features, embiggen_factor, seed)
initial_centroids = choose_random_centroids(samples, n_clusters)
model = tf.global_variables_initializer()
with tf.Session() as session:
sample_values = session.run(samples)
updated_centroid_value = session.run(initial_centroids)
plot_clusters(sample_values, updated_centroid_value, n_samples_per_cluster)
The major change here is that we create a variable for these initial centroids, and compute its value in the session. We then plot out those first guesses to plot_cluster
, rather than the actual centroids that were used to generate the data.
Running this will net a similar image to above, but the centroids will be in random positions. Try running this script a few times, noting that the centroids move around quite a bit.
Updating Centroids
After starting with some guess for the centroid locations, the k-means algorithm then updates those guesses based on the data. The process is to assign each sample a cluster number, representing the centroid it is closest to. After that, the centroids are updated to be the means of all samples assigned to that cluster. The following code handles the assign to nearest cluster step:
def assign_to_nearest(samples, centroids):
# Finds the nearest centroid for each sample
# START from https://esciencegroup.com/2016/01/05/an-encounter-with-googles-tensorflow/
expanded_vectors = tf.expand_dims(samples, 0)
expanded_centroids = tf.expand_dims(centroids, 1)
distances = tf.reduce_sum( tf.square(
tf.subtract(expanded_vectors, expanded_centroids)), 2)
mins = tf.argmin(distances, 0)
# END from https://esciencegroup.com/2016/01/05/an-encounter-with-googles-tensorflow/
nearest_indices = mins
return nearest_indices
Note that I’ve borrowed some code from this page which has a different type of k-means algorithm, and lots of other useful information.
The way this works is to compute the distance between each sample and each centroid, which occurs through the distances =
line. The distance computation here is the Euclidean distance. An important point here is that tf.subtract
will automatically expand the size of the two arguments. This means that having our samples as a matrix, and the centroids as a column vector will produce the pairwise subtraction between them. In order to do this, we use tf.expand_dims
to create an extra dimension for both samples and centroids, forcing this behaviour of tf.subtract
.
The next bit of code hands the update centroids bit:
def update_centroids(samples, nearest_indices, n_clusters):
# Updates the centroid to be the mean of all samples associated with it.
nearest_indices = tf.to_int32(nearest_indices)
partitions = tf.dynamic_partition(samples, nearest_indices, n_clusters)
new_centroids = tf.concat([tf.expand_dims(tf.reduce_mean(partition, 0), 0) for partition in partitions], 0)
return new_centroids
This code takes the nearest indices for each sample, and grabs those out as separate groups using tf.dynamic_partition
. From here, we use tf.reduce_mean
on a single group to find the average of that group, forming its new centroid. From here, we just tf.concat
them together to form our new centroids.
Now we have the piece in place, we can add these calls to our script (or create a new one):
import tensorflow as tf
import numpy as np
from functions import *
n_features = 2
n_clusters = 3
n_samples_per_cluster = 500
seed = 700
embiggen_factor = 70
data_centroids, samples = create_samples(n_clusters, n_samples_per_cluster, n_features, embiggen_factor, seed)
initial_centroids = choose_random_centroids(samples, n_clusters)
nearest_indices = assign_to_nearest(samples, initial_centroids)
updated_centroids = update_centroids(samples, nearest_indices, n_clusters)
model = tf.global_variables_initializer()
with tf.Session() as session:
sample_values = session.run(samples)
updated_centroid_value = session.run(updated_centroids)
print(updated_centroid_value)
plot_clusters(sample_values, updated_centroid_value, n_samples_per_cluster)
This code will:
- Generate samples from initial centroids
- Randomly choose initial centroids
- Associate each sample to its nearest centroid
- Update each centroid to be the mean of the samples associated to it
This is a single iteration of k-means! I encourage you to take a shot at the exercises, which turns this into an iterative version.
1) The seed option passed to generate_samples
ensures that the samples that are “randomly” generated are consistent everytime you run the script. We didn’t pass on the seed to the choose_random_centroids
function, which means those initial centroids are different each time the script is run. Update the script to include a new seed for random centroids.
2) The k-means algorithm is performed iteratively, where the updated centroids from the previous iteration are used to assign clusters, which are then used to update the centroids, and so on. In other words, the algorithm alternates between calling assign_to_nearest
and update_centroids
. Update the code to perform this iteration 10 times, before stopping. You’ll find that the resulting centroids are much closer on average with more iterations of k-means. (For those that have experience with k-means, a future tutorial will look at convergence functions and other stopping criteria.)