Training and Convergence

A key component of most artificial intelligence and machine learning is looping, i.e. the system improving over many iterations of training. A very simple method to train in this way is just to perform updates in a for loop. We saw an example of this way back in lesson 2:

import tensorflow as tf

x = tf.Variable(0, name='x')

model = tf.global_variables_initializer()

with tf.Session() as session:
    for i in range(5):
        x = x + 1

We can alter this workflow to instead use a variable as the convergence loop, such as in the following:

import tensorflow as tf

x = tf.Variable(0., name='x')
threshold = tf.constant(5.)

model = tf.global_variables_initializer()

with tf.Session() as session:
    while, threshold)):
        x = x + 1
        x_value =

The major change here is that the loop is now a while loop, continuing to loop while the test (using tf.less for a less-than-test) is true. Here, we test if x is less than a given threshold (stored in a constant), and if so, we continue looping.

Gradient Descent

Any machine learning library must have a gradient descent algorithm. I think it is a law. Regardless, Tensorflow has a few variations on the theme, and they are quite straight forward to use.

Gradient Descent is a learning algorithm that attempts to minimise some error. Which error do you ask? Well that is up to us, although there are a few commonly used methods.

Let’s start with a basic example:

import tensorflow as tf
import numpy as np

# x and y are placeholders for our training data
x = tf.placeholder("float")
y = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0], name="w")
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1]

# Our error is defined as the square of the differences
error = tf.square(y - y_model)
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(error)

# Normal TensorFlow - initialize values, create a session and run the model
model = tf.global_variables_initializer()

with tf.Session() as session:
    for i in range(1000):
        x_value = np.random.rand()
        y_value = x_value * 2 + 6, feed_dict={x: x_value, y: y_value})

    w_value =
    print("Predicted model: {a:.3f}x + {b:.3f}".format(a=w_value[0], b=w_value[1]))

The major line of interest here is train_op = tf.train.GradientDescentOptimizer(0.01).minimize(error) where the training step is defined. It aims to minimise the value of the errorVariable, which is defined earlier as the square of the differences (a common error function). The 0.01 is the step it takes to try learn a better value.

An important note here is that we are optimising just a single value, but that value can be an array. This is why we used w as the Variable, and not two separate Variables a and b.

Other Optimisation

TensorFlow has a whole set of types of optimisation, and has the ability for your to define your own as well (if you are into that sort of thing). For the API of our how to use them, see this page. The listed ones are:

  • GradientDescentOptimizer
  • AdagradOptimizer
  • MomentumOptimizer
  • AdamOptimizer
  • FtrlOptimizer
  • RMSPropOptimizer

Other optimisation methods are likely to appear in future releases of TensorFlow, or in third-party code. That said, the above optimisations are going to be sufficient for most deep learning techniques. If you aren’t sure which one to use, use GradientDescentOptimizer unless that is failing.

Plotting the error

We can plot the errors after each iteration to get the following output:


The code for this is a small change to the above. First, we create a list to store the errors in. Then, inside the loop, we explicitly compute both the train_op and the error. We do this in a single line, so that the error is computed only once. If we did this is separate lines, it would compute the error, and then the training step, and in doing that it would need to recompute the error.

Below I’ve put the code just for below the tf.global_variables_initializer() line from the previous program – everything above this line is the same.

errors = []
with tf.Session() as session:
    for i in range(1000):
        x_train = tf.random_normal((1,), mean=5, stddev=2.0)
        y_train = x_train * 2 + 6
        x_value, y_value =[x_train, y_train])
        _, error_value =[train_op, error], feed_dict={x: x_value, y: y_value})
    w_value =
    print("Predicted model: {a:.3f}x + {b:.3f}".format(a=w_value[0], b=w_value[1]))

import matplotlib.pyplot as plt
plt.plot([np.mean(errors[i-50:i]) for i in range(len(errors))])

You may have noticed that I take a windowed average here – using np.mean(errors[i-50:i]) instead of just using errors[i]. The reason for this is that we are only doing a single test inside the loop, so while the error tends to decrease, it bounces around quite a bit. Taking this windowed average smooths this out a bit, but as you can see above, it still jumps around.

1) Create a convergence function for the k-means example from Lesson 6, which stops the training if the distance between the old centroids and the new centroids is less than a given epsilon value.

2) Try separate the a and b values from the Gradient Descent example (where w is used).

3) Our example trains on just a single example at a time, which is inefficient. Extend it to learn using a number (say, 50) of training samples at a time.