Weight Initialization

In this article, I am writing some notes about weight initialization. I took these from various sources I am reading regarding this. Links and References at the end of the post! Consider reading them to get a clear understanding!

NOTE: This is post for my future self to look back and review the material. So, this’ll be very unpolished!

Introduction

The first step that comes in consideration while building a neural network is the initialization of parameters - weights and biases. If not done correctly then layer activations might explode or vanish during the forward propagation which in turn makes loss gradients to be either too large or too small.
Then achieving optimization will take longer or sometimes converging to a minima using gradient descent will be impossible.

Some key points to remember

If the weights are initialized too large or too small, the network won’t learn well - because it leads to exploding or vanishing gradients problem.
All weights should not be initialized with zeros.
- If neurons starts with same weights, then all neurons will learn the same features and perform the same thing as one another.
- Neural Networks try to reach the local minima, If all the weights start at zero - it is not possible. So, it is better to give them different starting values.

Weight Initialization methods

Normal Initialization

The authors of the famous Alexnet Paper initialized weights using zero-mean Guassian (normal) distribution with a standard deviation of 0.01. The biases were initialized as 1 for some layers and 0 for the rest.

Uniform initialization: bounded uniformly between ~ [ $\frac{- 1}{f _{in}}, \frac{1}{f _{in}}$ ]

But this normal random initialization of weights does not work well for training deep neural networks, because of vanishing and exploding gradient problem.

Xavier Initialization / Glorot initialization [paper]

Proposed by Xavier and Bengio
considers number of input and output units while initializing weights
weights stay within a reasonable range by making them inversely proportional to the square root of the number of units in the previous layer

Uniform: bounded uniformly between ~ [ $\pm \frac{6}{f _{in} + f _{o u t}}$ ]

Normal: multiply normal distribution by $\frac{2}{f _{in} + f _{o u t}}$

np.random.rand(shape) * np.sqrt( $\frac{2}{f _{in} + f _{o u t}}$ )
or create normal distribution with $μ$ = 0 and $σ^{2}$ = $\frac{2}{f _{in} + f _{o u t}}$

He initialization / Kaiming initialization [paper]

RELU activations are mostly used - bercause they are robust to vanishing/ exploding gradients.
A more robust initialization technique was introduced by Kaiming et al. for activation functions like RELU.
both Xavier and He use similar theory →
- find a good variance for the distribution from which the initial parameters are drawn
- This variance is adapted to the activation function used
- derived without explicitly considering the type of the distribution
- Red → He and Blue → Xavier

Uniform: [ $\pm \frac{6}{f _{in}}$ ]

Normal: normal distribution * $\frac{6}{f _{in}}$

or $μ$ = 0 and $σ^{2}$ = $\frac{2}{f _{in}}$

Remember

use Xavier for Sigmoid, tanh and Softmax
use He for ReLU and Leaky ReLU

Jithendra Yenugula

Explorer

Weight Initialization

Introduction

Some key points to remember

Weight Initialization methods

Normal Initialization

Xavier Initialization / Glorot initialization [paper]

He initialization / Kaiming initialization [paper]

Remember

Resources

Graph View

Table of Contents

Backlinks