home..

Weight Initialization

machine-learning

In this article, I am writing some notes about weight initialization. I took these from various sources I am reading regarding this. Links and References at the end of the post! Consider reading them to get a clear understanding!

NOTE: This is post for my future self to look back and review the material. So, this’ll be very unpolished!

Introduction

The first step that comes in consideration while building a neural network is the initialization of parameters - weights and biases. If not done correctly then layer activations might explode or vanish during the forward propagation which in turn makes loss gradients to be either too large or too small.
Then achieving optimization will take longer or sometimes converging to a minima using gradient descent will be impossible.

Some key points to remember

Weight Initialization methods

Normal Initialization

The authors of the famous Alexnet Paper initialized weights using zero-mean Guassian (normal) distribution with a standard deviation of 0.01. The biases were initialized as 1 for some layers and 0 for the rest.

Uniform initialization: bounded uniformly between ~ [\(\frac{-1}{\sqrt{f_{in}}}, \frac{1}{\sqrt{f_{in}}}\)]

But this normal random initialization of weights does not work well for training deep neural networks, because of vanishing and exploding gradient problem.

Xavier Initialization / Glorot initialization [paper]

Uniform: bounded uniformly between ~ [\(\pm \sqrt { \frac {6} {f_{in} + f_{out}}}\)]

Normal: multiply normal distribution by \(\sqrt { \frac {2} {f_{in} + f_{out}}}\)

He initialization / Kaiming initialization [paper]

Uniform: [\(\pm \sqrt {\frac {6} {f_{in}} }\)]

Normal: normal distribution * \(\sqrt {\frac {6} {f_{in}}}\)

Remember

Resources

© 2025 Jithendra Yenugula