Training your Neural Network: On-Demand Webinar and FAQ Now Available!

Webinar-DL2-NeuralNetworks-OD-Facebook-1

Published: October 22, 2018

On October 9th, we hosted a live webinar—Training your Neural Network—on Data Science Central with Denny Lee, Technical Product Marketing Manager at Databricks. This is the second webinar of a free deep learning fundamental series from Databricks.

In this webinar, we covered the principles for training your neural network including activation and loss functions, batch sizes, data normalization, and validation datasets.

In particular, we talked about:

Hyperparameter tuning, learning rate, backpropagation and the risk of overfitting
Optimization algorithms, including Adam
Convolutional Neural Networks and why they are so effective for image classification and object recognition

We demonstrated some of these concepts using Keras (TensorFlow backend) on Databricks, and here is a link to our notebook to get started today:

You can still watch Part 1 below and now register to Part 3 where we will dive more into Convolutional Neural Networks and how to use them:

If you’d like free access Databricks Unified Analytics Platform and try our notebooks on it, you can access a free trial here.

Toward the end, we held a Q&A, and below are all the questions and their answers, grouped by topics.

Fundamentals

Q: What is the difference between a perceptron and an artificial neural network, neuron, or node?

A perceptron is a single layer artificial neural network (ANN) utilized as a binary classifier. As neuron is typically associated with the biological neuron, often the term node is used to reference an artificial neuron within an ANN.

Q: How do you decide the number of hidden layers for your Artificial Neural Network (ANN)? Is it common that the ANN needs to be four layers by four neurons?

As noted in Introduction to Neural Networks On Demand Webinar and FAQ Now Available, while there are general rules of thumb on your starting point (e.g. start with one hidden layer and expand accordingly, number of input nodes is equal to the dimension of features, etc.), the key thing is that you will need to test. That is, train your model and then run the test and/or validation runs against that model to understand the accuracy (higher is better) and loss (lower is better).

Q: What is the difference between a Recurrent Neural Network when compared to Convolutional Neural Networks as discussed in the webinar?

Convolutional Neural Networks are well designed for image data but have the limitation that they are designed for fixed-size input and output vectors (e.g. MNIST digit image as the input and one of 10 digits as the possible output). Recurring Neural Networks (RNNs) overcome this limitation because they work against sequences of vectors. A great blog on the topic of RNNs is The Unreasonable Effectiveness of Recurrent Neural Networks.

Activation Functions

Q: In the webinar, did you suggest that you should never use the Sigmoid activation function?

Specifically, the quote within the webinar (slide 15) is from CS231N Convolutional Neural Networks for Visual Recognition course at Stanford by Andrej Karparthy (currently Director of AI at Tesla):

“What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout.

Generally, this is a good rule of thumb as your starting point on what activation functions you should be using. The focus should be more about using ReLU, Leaky ReLU, or Maxout activation functions as they often result in higher accuracy and lower loss. For more information, please refer to Introduction to Neural Networks where we dive deeper into activation functions.

Optimization

Q: Why use Stochastic Gradient Descent (SGD) as your optimization if there may be better optimizers such as ADADelta?

In this webinar, we focused on the specific area of image classification and it can be seen that using Adadelta optimizer converged much faster than other optimizations - for this scenario. Also alluded to in the webinar, there are other variables in play: activation functions, neural network architecture, usage scenarios, etc. As there is still more research to come on this topic, there will be new strategies and optimization techniques to try out.

Q: When should we use gradient boosting (e.g. Adaptive Boosting or AdaBoost) instead of Artificial Neural Networks Gradient Descent? When can we use AdaBoost instead of ADADelta?

Adaptive Boosting (AdaBoost) is the adaptive technique of gradient boosting as described in the paper A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. In general, the idea is that you would boost weak learners (learners that do only slightly better than random chance) by filtering out data so that the weak learners could more easily handle the dataset. While the current research points toward ANNs being simpler and faster to converge, it is also important to note that this is far from a definitive statement. For example, while the paper Comparison of machine learning methods for classifying mediastinal lymph node metastasis of non-small cell lung cancer from 18F-FDG PET/CT images notes that CNNs are more convenient, the paper Learning Deep ResNet Blocks Sequentially using Boosting Theory notes that their BoostResNet algorithm is more computationally efficient than compared to end-to-end back propagation in deep ResNet. While these are two very disparate papers, the important call out is that there is still more important work being done in this exciting field.

Q: How can you adapt your model when working with skewed data?

Within the context of convolutional neural networks, the issue being described here is a class imbalance problem. A great paper on this topic is a systematic study of the class imbalance problem in convolutional neural networks. In general, it calls out the impact of class imbalance is quite high and oversampling is the current dominant mechanism to address it.