• CNN = Convolutional Neural Network
  • It reduces the number of input nodes
  • Tolerate small shifts in images (as pooling is used, small shift result in same weight)
  • Take advantage of local context or relation as it uses filter to gather local information
  • The matrix obtained as a result convolution operation is called activation map
  • Typically ReLU activation function is used
  • Usually use Padding in CNN
  • Stride in CNN is used to scan through the image
  • Typical case:
    • Filter Size of 2 or 3
    • Stride size of 2
    • Max pooling


  1. Filter scans through left to right, top to bottom
  2. Filter weights and Image weights have a dot product (Element-wise multiplication and sum)
  3. Use Pooling to gain information

Common Structure of Vision Models

(Filter -> Pooling) x N -> (Dense Network) x M -> Output Layer