SystemAntics: DNNs, ML, Systems and Code...: Overview of Forward and Backward Propagation in Convolutional Neural Networks

In this post, I will derive the backpropagation equations of a CNN and explain them with some code snippets. The recipe followed is very similar to the deriving backprop equations for a simple feed-forward networks I wrote in this post. If you have not read the earlier post, I would highly recommend you read through that post first.

What is convolution and why is this needed?
Imagine you want to train a DNN on an image, say of size 100 * 100. And you want to connect this to a fully-connected layer with 100 neurons. The weight matrix we will need to learn will be of size (100 * 10000) or a million parameters. Extending this to larger images and complex networks will require even make the parameter space even larger. Instead of going down this path, we try to learn some features of smaller dimensions from the image and use this to build our network. Here is where convolution comes in.

A convolution operation is depicted below. We refer to W matrix below as a filter.

where

$z_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21}+W_{22}X_{22}$
$z_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22}+W_{22}X_{23}$
$z_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31}+W_{22}X_{32}$
$z_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32}+W_{22}X_{33}$

As a concrete example, let's pass an image through the two filters given below:

# The first filter converts the image to grayscale.

$w[0, 0, :, :] = [[0, 0, 0], [0, 0.3, 0], [0, 0, 0]]$
$w[0, 1, :, :] = [[0, 0, 0], [0, 0.6, 0], [0, 0, 0]]$
$w[0, 2, :, :] = [[0, 0, 0], [0, 0.1, 0], [0, 0, 0]]$

# Second filter detects horizontal edges in the blue channel.
- $w[1, 2, :, :] = [[1, 2, 1], [0, 0, 0], [-1, -2, -1]]$

This is fascinating. You can imagine coming up with filters to detect eyes or face or edges. In convolutional neural networks, we try to learn these filters.

Max-pool layers

A CNN network usually has a conv layer as shown above coupled with a max-pool (or an average pool) layer. Max pooling is done by applying a max filter to (usually) non-overlapping sub-regions of the initial representation.

The intuition behind this operation is it reduces the dimensionality even more and thereby reducing the number of parameters to learn. It tries to retain the dominant feature in a region. As the parameter space is reduced, it reduces over-fitting.

In this post, I am not going into all the details of a CNN. Please go through the references to learn more. Instead, I am going to explain code snippets for both forward propagation and backward propagation, as well as explain the equations. The code snippets are taken from the Deep Learning specialization in Coursera.

Let’s start with couple of helper functions:

a) zero_pad$(X, pad)$ takes in a dataset $X$ and pads it with zeros. In this function, it’s important to take account which dimensions you are padding. We want to pad an image along its height and width. Check which dimensions of the input capture these and accordingly pad.
b) conv_single_step$(a\_slice\_prev, W, b)$ takes in single slice of previous activation and a single filter and performs the convolution operation as shown in the figure above. Additionally, since each filter also has a bias term associated with it, it adds that as well.

Forward propagation
Now that we have the helper functions ready, lets do a forward pass with convolutional layer. The naïve implementation is not complex, albeit the different indices. The code below takes different slices of input and applies the convolution operation.

Backward propagation (Conv layer)
Let's delve into the backprop equations for a convolution layer. We will start with the equations mentioned above: \begin{align} z_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21} + W_{22}X_{22} + b^{[1]} \\ z_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22} + W_{22}X_{23} + b^{[1]} \\ z_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31} + W_{22}X_{32} + b^{[1]} \\ z_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32} + W_{22}X_{33} + b^{[1]} \end{align} Let's try to compute $\frac{\partial{L}}{\partial{W_{11}}}$. If we note the above equations, $W_{11}$ affects all the $z$ values. Using chain rule, we can express \begin{align} \frac{\partial{L}}{\partial{W_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * \frac{\partial{z_{11}}}{\partial{W_{11}}} + \frac{\partial{L}}{\partial{z_{12}}} * \frac{\partial{z_{12}}}{\partial{W_{11}}} + \frac{\partial{L}}{\partial{z_{21}}} * \frac{\partial{z_{21}}}{\partial{W_{11}}} + \frac{\partial{L}}{\partial{z_{22}}} * \frac{\partial{z_{22}}}{\partial{W_{11}}} \end{align} This will evaluate to \begin{align} \frac{\partial{L}}{\partial{W_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * X_{11} + \frac{\partial{L}}{\partial{z_{12}}} * X_{12} + \frac{\partial{L}}{\partial{z_{21}}} * X_{21} + \frac{\partial{L}}{\partial{z_{22}}} * X_{22} \end{align} This is nothing but a convolution operation as depicted below:

In general, we multiply the gradients $Z$ with the corresponding input slice (earlier activation). And since the filter is shared across inputs, we just add up the gradients. Generalizing this equation, we can express this as

\begin{align} \partial{W_C} += \sum_{h = 0}^{n_H}\sum_{w = 0}^{n_W}a_{slice} * \partial{Z_{hw}} \end{align}

$\partial{W_C}$ represents the derivative of one filter with respect to the loss.

Let's proceed and compute $\frac{\partial{L}}{\partial{b^{[1]}}}$. \begin{align} \frac{\partial{L}}{\partial{b^{[1]}}} = \frac{\partial{L}}{\partial{z_{11}}} * \frac{\partial{z_{11}}}{\partial{b^{[1]}}} + \frac{\partial{L}}{\partial{z_{12}}} * \frac{\partial{z_{12}}}{\partial{b^{[1]}}} + \frac{\partial{L}}{\partial{z_{21}}} * \frac{\partial{z_{21}}}{\partial{b^{[1]}}} + \frac{\partial{L}}{\partial{z_{22}}} * \frac{\partial{z_{22}}}{\partial{b^{[1]}}} \end{align} This is nothing but \begin{align} \frac{\partial{L}}{\partial{b^{[1]}}} = \frac{\partial{L}}{\partial{z_{11}}} + \frac{\partial{L}}{\partial{z_{12}}} + \frac{\partial{L}}{\partial{z_{21}}} + \frac{\partial{L}}{\partial{z_{22}}} \end{align}

\begin{align} \frac{\partial{L}}{\partial{b}} = \sum_h\sum_w\partial{Z_{hw}} \end{align}

Let's proceed to our last derivation: Differentiating with respect to the input to the conv layer $\frac{\partial{L}}{\partial{A}}$. Let's look at one such expression: \begin{align} \frac{\partial{L}}{\partial{A_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * \frac{\partial{z_{11}}}{\partial{A_{11}}} + \frac{\partial{L}}{\partial{z_{12}}} * \frac{\partial{z_{12}}}{\partial{A_{11}}} + \frac{\partial{L}}{\partial{z_{21}}} * \frac{\partial{z_{21}}}{\partial{A_{11}}} + \frac{\partial{L}}{\partial{z_{22}}} * \frac{\partial{z_{22}}}{\partial{A_{11}}} \end{align} This evaluates to: \begin{align} \frac{\partial{L}}{\partial{X_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * W_{11} + \frac{\partial{L}}{\partial{z_{12}}} * 0 + \frac{\partial{L}}{\partial{z_{21}}} * 0 + \frac{\partial{L}}{\partial{z_{22}}} * 0 \end{align} As we see this is some sort of a strange convolution (Its rather called full convolution). Only select terms from the filters take part in the derivation. Either we can do this way, or realize that we can do a simple convolution with a zero-padded matrix, which is what is done in the code below.
In general,

\begin{align} \partial{A} += \sum_{h = 0}^{n_H}\sum_{w = 0}^{n_W}W_c * \partial{Z_{hw}} \end{align}

Where $W_c$ is a filter and $\partial{Z_{hw}}$ is a scalar corresponding to the gradient of the cost with respect to the output of the conv layer Z at the hth row and with column. Let's look at these equations used in the code below.

Backward propagation (MAX pool layer)
Since there is no parameter to learn in this layer, there is no explicit gradient value to deal with. Still, we have to pass on the gradient from this layer to earlier layer. If you look at this operator, it takes a max of all the input values in a slice. In effect, only this value (which happens to be the maximum in the slice) will affect the gradient. So, during the forward pass we create a mask where 1 denotes a max value and 0 denotes the other values. Then we just multiply the gradient with this mask to get the change to the corresponding input during the backward propagation.

Please post your comments below, I will be happy to answer them.

References:
https://www.coursera.org/learn/convolutional-neural-networks
http://cs231n.github.io/assignments2018/assignment2/
https://becominghuman.ai/back-propagation-in-convolutional-neural-networks-intuition-and-code-714ef1c38199
https://medium.com/@2017csm1006/forward-and-backpropagation-in-convolutional-neural-network-4dfa96d7b37e

SystemAntics: DNNs, ML, Systems and Code...

Sunday, December 16, 2018

Overview of Forward and Backward Propagation in Convolutional Neural Networks

No comments:

Post a Comment