While training deep neural net, there are many parameters to be initialized and trained through the forward and backward propagation. A lot of times we spent a lot of time on trying different activation function, tuning the depth of deepnet, and number of units and other hyperparameters. But we may forget the importance of initialization of its weights and biases. In this article, I’ll share three ways of initialization methods (1. zeros initialization, 2. Random initialization 3. He initialization) and see their corresponding impact.
In this example, it is a three-layer neural network with the following setup: Linear –> RELU –> Linear –> RELU -> Linear –> Sigmoid. So the first two hidden layer are (linear + Relu) and the last layer (L) is (linear + sigmoid) as illustrated in the below figure.
The dataset is created using the following code.
The data looks like this:
- Zero initialization
In this case, just assign all parameters and bias to zero using np.zeros().
The below plots shows that none of the points are correctly separated and the logloss cost function stays stagnant since all the neuron are the same.
2. Random initialization
In this case, the weights are randomly initialized by a large number 10 and bias set to zero. And we can see the neural network starts to learn correctly.
Cost with Random Initialization
3. He initialization
Last, we’ll see how ‘He’ initialization method works. In He et al., 2015, they proposed a new way for neural network: sqrt(2./layer_dims[l-1]). And we can see that this separated the two-class very well.
As we can see that initialization is very important in training deep neural network. It is important to break the symmetry of the neurons in the same layer and proper initialization can make your training much faster.
The forward propagation and backward propagation are shown below:
One of the most frequently used activation function in output layers for multi-class classification neural network is softmax. Softmax is defined as f(X) = exp(xi)/sum(exp(xi)) and it returns probability for each individual class with all probability the sum of one. For the two-class problem, sigmoid will return the same probability as softmax.
While translating softmax into program code, there are some little thing to watch out due to numerical instability associated with exploding gradient or vanishing gradient. Let’s look at two examples:
While the weights increase 1000 X, the probability becomes useless, either 0 or 1.
The same thing happens to the weights vanishing. All the weights now share the same probability as the weights approach zero.
However, there is an easy fix for addressing the exploding and vanishing weights by modifying the softmax function to softmax(X + c). Mostly commonly used C is max(X) and it leaves the weight vector to be all negative. It will rule out overflow and vanishing denominator with at least one zero elements. Underflow some but not all weights are harmless.
Let’s see the impact.
As seen from the two examples with stable syntax, it saved more weights from vanishing or exploding gradient.
We all know deep learning, especially in computation vision, are resource intensive. It will give you an even more straightforward connection on why we say that by looking at the settings for ConvNet configuration. The amount of memory and parameters to be used and computed.
Source: Fei-Fei Li & Andrej Karpathy, Stanford University.
There are 13 convolution layers with filters size of 3 by 3, within which several pooling layers with a stride of two were used. Then three fully-connected neural network with nodes of O(1000) used. As seen from the numbers, the majority of memory is in early CONV and the majority of parameters are in late FC for this ConvNet.
Now you’ll know why GPU and parallel computing is very helpful for the deepnet.
Deep learning is one of the hottest buzzwords in tech and is impacting everything from health care to transportation to manufacturing, and more. Companies are turning to deep learning to solve hard problems, like speech recognition, object recognition, and machine translation.
Everything new breakthrough comes with challenges. The biggest challenge for deep learning is that it requires intensive training of the model and massive amount of matrix multiplications and other operations. A single CPU usually has no more than 12 cores and it will be a bottleneck for deep learning network development. The good thing is that all the matrix computation can be parallelized and that’s where GPU comes into rescue. A single GPU might have thousands of cores and it is a perfect solution to deep learning’s massive matrix operations. GPUs are much faster than CPUs for deep learning because they have orders of magnitude more resources dedicated to floating point operations, running specialized algorithms ensuring that their deep pipelines are always filled.
The good thing is that all the matrix computation can be parallelized and that’s where GPU comes into rescue. A single GPU might have thousands of cores and it is a perfect solution to deep learning’s massive matrix operations. GPUs are much faster than CPUs for deep learning because they have orders of magnitude more resources dedicated to floating point operations, running specialized algorithms ensuring that their deep pipelines are always filled.
GPUs are much faster than CPUs for deep learning because they have orders of magnitude more resources dedicated to floating point operations, running specialized algorithms ensuring that their deep pipelines are always filled.
Now we’re know why GPU is necessary for deep learning. Probably you’re interested in deep learning and can’t wait to do something about it. But you don’t have big GPUs on your computer. The good news is that there are public GPU serves for you to start with. Google, Amazon, OVH all have GPU servers for you to rent and the cost is very reasonable.
In this article, I’ll show you how to set up a deep learning server on Amazon ec2, p2-2xlarge GPU instance in this case. In order to set up amazon instance, here is the prerequisite software you’ll need:
- Python 2.7 (recommend anaconda)
- Cygwin with wget, vim (if on windows)
- Install Amazon AWS Command Line Interface (AWS CLI), for Mac
Here is the fun part:
- Register an Amazon ec2 account at: https://aws.amazon.com/console/
- Go to Support –> Support Center –> Create case (Only for the new ec2 user.)Type in the information in the form and ‘submit’ at the end. Wait for up to 24-48 hours for it to be activated. If you are already an ec2 user, you can skip this step.
- Create new user group. From console, Services –> Security, Identity & Compliance –> IAM –> Users –> Add user
- After created new user, add permission to the user by click the user just created.
- Obtain Access keys: Users –> Access Keys –> Create access key. Save the information.
- Now we’re done with Amazon EC2 account, go to Mac Terminal or Cygwin on Windows
- Download set-up files from fast.ai. setup_p2.sh and setup_instance.sh . Change the extension to .sh since WordPress doesn’t support bash file upload
- Save the two shell script to your current working directory
- In the terminal, type: aws configure Type in the access key ID and Secret access key saved in step 5.
- bash setup_p2.sh
- Save the generated text (on terminal) for connecting to the server
- Connect to your instance: ssh -i /Users/lxxxx/.ssh/aws-key-fast-ai.pem firstname.lastname@example.org
- Check your instance by typing: nvidia-smi
- Open Chrome Browser with URL: ec2-34-231-172-2xx.compute-1.amazonaws.com:8888Password: dl_course
- Now you can start to write your deep learning code in the Python Notebook.
- Shut down your instance in the console or you’ll pay a lot of money.
For a complete tutorial video, please check Jeremy Howard’s video here.
The settings, passwords are all saved at ~/username/.aws , ~/username/.ipython.