Vanishing Gradient

  • Vanishing gradient means the gradient becomes so less that to a computer it is 0
  • Why it arises?
    • Because of memory precision
    • Because of multiplication of all layers (too deep)
  • How to identify?
    • Parameters of the top layers are changing, whereas on bottom layers no change
    • Model learns on a very slow pace
    • Training could stagnate at a very early phase after a few iterations
  • What to do?
    • What to do depends on the architecture and the reason of vanishing gradient
    • The few common ways are
      1. LSTM
      2. ReLU - which introduces Exploding Gradient
      3. Batch Normalization
      4. Weight Initialization
      5. Skip Connection
      6. GRU
      7. Reduce Network depth

TODO:

  1. Read how the ways do solve
  2. Create flash card