visit
Universal Approximation Theorem says that Feed-Forward Neural Network (also known as Multi-layered Network of Neurons) can act as powerful approximation to learn the non-linear relationship between the input and output. But the problem with the Feed-Forward Neural Network is that the network is prone to over-fitting due to the presence of many parameters within the network to learn.
Can we have another type of neural network that can learn complex non-linear relationship but with fewer parameters and hence prone to over-fitting?. Convolution Neural Network (CNN) is another type of neural network that can be used to enable machines to visualize things and perform tasks such as image classification, image recognition, object detection, instance segmentation etc…are some of the most common areas where CNN’s are used.
Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — .
In the convolution operation we are given a set of inputs and we calculate the value of the current input based on all its previous inputs and their weights.In this example, I haven’t talked about how we obtain these weights whether these weights are right or wrong. For now, just focus on how the convolution operation works.
Convolution with 2D filter
Where,
K —Matrix that represents the weights assigned to pixel values. It has two indices a,b — a denotes rows and b denotes columns.I — Matrix containing the input pixel values.Sᵢⱼ — The re-estimated value of a pixel at a location.Let’s take an example to understand how the formula works. Imagine that we have an image of TajMahal and 3x3 weight matrix (also known as Kernel). In convolution operation, we impose the kernel on the image such that the pixel of interest would be aligned with the center of the kernel and then we will compute the weighted average of all neighborhood pixels. Then we will slide the kernel from left to right till it passes the entire width and then top to bottom to compute the weighted average of all the pixels present in the image. The convolution operation would look like this,Even though our input is 3D, the kernel is 3D but the convolution operation that we are performing is 2D that’s because the depth of the filter is the same as the depth of the input.
Application of Multiple Filters
All of these outputs can be stacked on top of each other combined to form a volume. If we apply three filters on the input we will get an output of depth equal to 3. Depth of the output from the convolution operation is equal to the number of filters that are being applied on the input.Why is the output is smaller?Since we can’t place the kernel at the corners as it will cross the input boundary. The value of those pixels outside the image are undefined so we don’t know how can we compute the weighted average of pixels in that area.
For every pixel in the input, we are not computing the weighted average and re-estimating the pixel value. This is true for all the shaded pixels present in the image (at least with 3x3 kernel), hence the size of the output will be reduced. This operation is known as Valid Padding.
What if we want the output to be the same size as the input?
The size of the original input was 7x7 and we also want the output size to be 7x7. So in that case what we can do is that we can add an artificial pad evenly around the input with zeros such that we would be able to place the kernel K (3x3) on the corner pixels and compute the weighted average of neighbors.
By adding this artificial padding around the input we are able to maintain the shape of the output as same as the input. If we have a bigger kernel (K 5x5) then the amount of pad we need to apply also increases such that we would be able to maintain the same output size. In this process, the size of the output is the same as the size of the output hence the name Same Padding (P).
So far, we have seen in the images that we are sliding the kernel (filter) from left to right with a certain interval until it passes the width of the image. Then we are sliding from top to bottom till the entire image transverses. Stride (S) defines the interval at which the filter is applied. By choosing the stride(interval) more than 1 we are skipping a few pixels when we are computing the weighted average of the neighbors. Higher the stride smaller the size of the output image.
If we combine the things we learned in this section into a mathematical formula, that can help us to find the width and depth of the output image. The formulae would look like this,Finally, coming to the depth of the output if we apply ‘K’ filters on our input we would get ‘K’ such 2D outputs. Hence the depth of the output is the same as the number of filters.
How did we arrive at Convolution Neural Networks?Before we discuss the Convolution Neural Networks, let's travel back in time and understand how image classification was done in pre-deep learning era. That also acts as a motivation for why we prefer Convolution Neural Networks for Computer Vision.
Photo by on
Let’s take the task of image classification, where we need to classify the given image into one of the classes. The earlier method of achieving this to flatten the image i.e… image of 30x30x3 is flattened into a vector of 2700 and feed this vector into any of the machine learning classifiers like SVM, Naive Bayes, etc…The key takeaway in this method is that we are feeding the raw pixels as the input to the Machine Learning algorithms and learning the parameters of the classifiers for image classification.Instead of manually generating the feature representation of an image. Why not flatten the image into a vector of 2700x1 and pass it into the Feed-Forward Neural Network or Multi-layered Network of Neurons (MLN) so that the network can learn the feature representation also?
Unlike static methods like SIFT/HOG, Edge detector, we are not fixing the weights but we are allowing the network to learn through the back-propagation such the overall loss of the network reduces. Feed-Forward Neural nets can learn a single feature representation of the image but in the case of complex images, neural nets will fail to give better predictions because it can’t learn pixel dependencies present in the images.Remember when we are trying to compute the output h₁₁ we have considered only 4 inputs and similarly for the output h₁₂. One important point to note is that we are using the same 2x2 kernel to calculate the h₁₁ and h₁₂ i.e… the same weights are being used to compute the outputs. Unlike the Feed-Forward Neural Network where each neuron present in the hidden layer will have separate weights for itself. This phenomenon of utilizing the same weights across the input to compute the weighted average is called Weight Sharing.
Consider that we have an input volume with length, width, and depth of 3 channels. When we apply a filter of the same depth to the input we would get a 2D output also known as feature map of the input. Once we have got the feature map typically we will perform an operation called Pooling operation. Since the number of hidden layers required to learn the complex relations present in the image would be huge. we apply pooling operation to reduce the input feature representation thereby reducing the computational power required for the network.
Once we obtain the feature map of the input, we will apply a filter of determined shape across the feature map to get the maximum value from that portion of the feature map. This is known as Max Pooling. It is also known as Sub-Sampling because from the entire portion of the feature map covered by kernel we are sampling one single maximum value.
Similar to Max Pooling, Average Pooling computes the average value of the feature map covered by the kernel.
LeNet — 5
Once we have done a series of convolution and pooling operation (either max pooling or average pooling) on the feature representation of the image. We will flatten the output of the final pooling layer into a vector and pass that through the Fully Connected layers (Feed-Forward Neural Network) with varying number of hidden layers to learn the non-linear complexities present with the feature representation.Finally, the output of the Fully Connected layers is passed through a Softmax layer of the desired size. Softmax layer outputs a vector of probability distributions which helps to perform the task of image classification. In the problem of digit recognizer (shown above) the output softmax layer has 10 neurons to classify the input into one of the 10 classes (0–9 digits).In my next post, we will discuss how to visualize the workings of Convolution Neural Network using Pytorch. Until then Peace :)NK.You connect with me on or follow me on for updates about upcoming articles on deep learning and Artificial Intelligence.
Originally published at on Sep 30, 2019.