TensorFlow™ on Databricks
Variables
TensorFlow is a way of representing computation without actually performing it until asked. In this sense, it is a form of lazy computing, and it allows for some great improvements to the running of code:- Faster computation of complex variables
- Distributed computation across multiple systems, including GPUs.
- Reduced redundency in some computations
Let’s have a look at this in action. First, a very basic python script:
This script basically just says “create a variable x with value 35, set the value of a new variable y to that plus 5, which is currently 40, and print it out”. The value 40 will print out when you run this program. If you aren’t familiar with python, create a new text file called basic_script.py
, and copy that code in. Save it on your computer and run it with:
python basic_script.py
Note that the path (i.e. basic_script.py
) must reference the file, so if it is in the Code
folder, you use:
python Code/basic_script.py
Also, make sure you have activated the Anaconda virtual environment. On Linux, this will make your prompt look something like:
(tensorenv)username@computername:~$
If that is working, let’s convert it to a TensorFlow equivalent.
After running this, you’ll get quite a funny output, something like <tensorflow.python.ops.variables.Variable object at 0x7f074bfd9ef0>
. This is clearly not the value 40.
The reason why, is that our program actually does something quite different to the previous one. The code here does the following:
- Import the tensorflow module and call it
tf
- Create a constant value called x, and give it the numerical value 35
- Create a Variable called y, and define it as being the equation x + 5
- Print out the equation object for y
The subtle difference is that y isn’t given “the current value of x + 5” as in our previous program. Instead, it is effectively an equation that means “when this variable is computed, take the value of x (as it is then) and add 5 to it”. The computation of the value of y is never actually performed in the above program.
Let’s fix that:
We have removed the print(y)
statement, and instead we have code that creates a session, and actually computes the value of y
. This is quite a bit of boilerplate, but it works like this:
- Import the tensorflow module and call it
tf
- Create a constant value called x, and give it the numerical value 35
- Create a Variable called y, and define it as being the equation x + 5
- Initialize the variables with
tf.global_variables_initializer()
(we will go into more detail on this) - Create a session for computing the values
- Run the model created in 4
- Run just the variable y and print out its current value
The step 4 above is where some magic happens. In this step, a graph is created of the dependencies between the variables. In this case, the variable y depends on the variable x, and that value is transformed by adding 5 to it. Keep in mind that this value isn’t computed until step 7, as up until then, only equations and relations are computed.
1) Constants can also be arrays. Predict what this code will do, then run it to confirm:
2) Generate a NumPy array of 10,000 random numbers (called x
) and create a Variable storing the equation
You can generate the NumPy array using the following code:
This data
variable can then be used in place of the list from question 1 above. As a general rule, NumPy should be used for larger lists/arrays of numbers, as it is significantly more memory efficient and faster to compute on than lists. It also provides a significant number of functions (such as computing the mean) that aren’t normally available to lists.
3) You can also update variables in loops, which we will use later for machine learning. Take a look at this code, and predict what it will do (then run it to check):
4) Using the code from (2) and (3) above, create a program that computers the “rolling” average of the following line of code: np.random.randint(1000)
. In other words, keep looping, and in each loop, call np.random.randint(1000)
once in that loop, and store the current average in a Variable that keeps updating each loop.
5) Use TensorBoard to visualise the graph for some of these examples. To run TensorBoard, use the command: tensorboard --logdir=path/to/log-directory
To find out more about Tensorboard, head to our visualisation lesson.