In this blog post we will derive the back propagation equations of a simple feed forward network with a single hidden layer. I have noticed that most of the courses just give out equations without any explanations. Personally, it is difficult for me to study anything without understanding the details, and I guess many of you will be in the same boat. Let's try to delve into the details and derive these equations.
Consider the network below:
Notations:
$n_x$ represents the number of input inputs or features
$n_h$ represents the number of neurons in a hidden layer
$n_L$ represents the number of neurons in the output layer
Output layer comprises of sigmoid activation function. Each layer performs a linear operation on its input followed by an activation function.
(X,Y) represents m training instances each with two features and one output label
X: 2 * m and Y = m * 1 where m is the number of training examples.
Let me quickly jot down the forward propagation equations
$W^{[1]}$ : ($n^{[h]})$, $n^{[x]}$, $W^{[2]})$ : ($n^{[y]}$, $n^{[h]}$) ,
$b^{[1]}$ : ($n^{[h]}$, $1$) and $b^{[2]}$ : ($n^{[y]}$, $1$)
It follows that the dimension of $z^{[1]}$ as well as $a^{[1]}$ is ($n^{[h]}$, $1$) and $z^{[2]}$ as well as $a^{[2]}$ is ($n^{[y]}$, $1$).
After an iteration of forward propagation, we want to gauge how well our network is doing. For this purpose, we use a loss function. In our case we will use the logistic loss
Now that the setup is done, lets proceed to the back propagation equations. Back propagation is a way to perform gradient descent on this neural network to adjust its weights/bias values, in order to minimize the loss above. The way we do this is to start from the back and determine how each parameter effects the loss value, and then update it.
Lets start with $\frac{\partial{L}}{\partial{a^{[2]}}}$ as we know the $a^{[2]}$ values (these are the predicted values).
Using this lets go one more step backwards. Using chain rule.
This equation basically tells how a change in $z^{[2]}$ will effect $a^{[2]}$ which in turn affects the overall loss. Note that $z^{[2]}$ directly affects the single output neuron here.
Since we use $a^{[2]}$ as a sigmoid activation function, $\frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}$ can be derived as $a^{[2]}(1 - a^{[2]})$. Combining equations 2 and 3 above, we have
So far so good. Lets try to derive $\frac{\partial{L}}{\partial{z^{[1]}}}$ using this, Here things start getting complex.
Before we delve further, $z^{[1]}$: $n_h$*1 therefore $\frac{\partial{L}}{\partial{z^{[1]}}}$ will also be of the same dimension. Lets try to derive a single value of this $\frac{\partial{L}}{\partial{z^{[1]}}}$ vector or in short $\partial{z_{i}^{[1]}}$ Consider a general case where $Z_i^{[1]}$ is connected to k neurons.
Therefore any change in $Z_i^{[1]}$ will affect all the neurons in $Z^{[2]}$ , which in turn will affect the loss.
In other words,
Lets do a quick dimension check. Since $Z^{[1]}$: $n_h$ * 1, $\frac{\partial{L}}{\partial{z_{1}^{[1]}}}$ should be $n_h$ * 1.
$W^{[1]}$ : $n_y$ * $n_h$ and $\frac{\partial{L}}{\partial{z^{[2]}}}$ : $n_y$ * 1 giving us the desired result.
Using the above equations, let's go ahead and derive the equations for the actual parameters $W^{[1]}$, $W^{[2]}$, $b^{[1]}$ and $b^{[2]}$. Note that in the equation above , g'($Z^{[1]}$) and $w^{[1]}$ can be cached during the forward pass and re-used. Next, we will derive $\frac{\partial{L}}{\partial{W^{[k]}}}$. That is, how a change in weight matrix will affect the loss value. Using chain rule again, we have
Let's zoom into one of the gradient values.
The first term in the above equation is nothing but part of $\partial{z_{k}}$ computed in Equation 7. Let's focus on the second term $\frac{\partial{z^{[k]}}}{\partial{W^{[k]}}}$.
From the above figure, we can see that $z_j$ is only affected by $a_i^{[k - 1]}$ with respect to $W_{ji}$. Hence, we can write
Let's proceed to our last derivation and compute $\frac{\partial{L}}{\partial{b^{[k]}}}$.
Let's do a dimension check here. The row-wise sum above will result in $n^{[k]}$ * 1, which is same as $\partial{b^{[k]}}$.
Conclusion
Thus, we have derived all the back-propagation equations:
As you would have noticed, once you understand the basic recipe, everything follows through. In the next post, we will see how to put these equations into practice for a real example. We will also see how Tensorflow can make your life much easier by computing the derivatives automatically.
Please add your comments below, I will be happy to answer them.
Consider the network below:
Notations:
$n_x$ represents the number of input inputs or features
$n_h$ represents the number of neurons in a hidden layer
$n_L$ represents the number of neurons in the output layer
Output layer comprises of sigmoid activation function. Each layer performs a linear operation on its input followed by an activation function.
(X,Y) represents m training instances each with two features and one output label
X: 2 * m and Y = m * 1 where m is the number of training examples.
Let me quickly jot down the forward propagation equations
\begin{align}
&z^{[1]} = W^{[1]}X + b^{[1]}& \\
&a^{[1]} = g(z^{1})& \\
&z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}& \\
&a^{[2]} = sigmoid(z^{[2]})& \\
\end{align}
where dimensions are as follows: $W^{[1]}$ : ($n^{[h]})$, $n^{[x]}$, $W^{[2]})$ : ($n^{[y]}$, $n^{[h]}$) ,
$b^{[1]}$ : ($n^{[h]}$, $1$) and $b^{[2]}$ : ($n^{[y]}$, $1$)
It follows that the dimension of $z^{[1]}$ as well as $a^{[1]}$ is ($n^{[h]}$, $1$) and $z^{[2]}$ as well as $a^{[2]}$ is ($n^{[y]}$, $1$).
After an iteration of forward propagation, we want to gauge how well our network is doing. For this purpose, we use a loss function. In our case we will use the logistic loss
\begin{gather}
L = 1/m \sum_{i=1}^{m} \left \{
-y_{i}\log({a^{[2]}_i}) - (1 - y_{i})(1 - log({a^{[2]}_i}))
\right \}
\end{gather}
Here $Y_i$ represents the actual output label for the instance i and $a_i^{[2]}$ represents the predicted output from the neural network.
Equation 1
Now that the setup is done, lets proceed to the back propagation equations. Back propagation is a way to perform gradient descent on this neural network to adjust its weights/bias values, in order to minimize the loss above. The way we do this is to start from the back and determine how each parameter effects the loss value, and then update it.
Lets start with $\frac{\partial{L}}{\partial{a^{[2]}}}$ as we know the $a^{[2]}$ values (these are the predicted values).
\begin{align}
\frac{\partial{L}}{\partial{a^{[2]}}} &=
\frac{\partial{\ }}{\partial{a^{[2]}}}
\left \{
1/m \sum_{i=1}^{m} \left \{
-Y_{i}\log({a^{[2]}_i}) - (1 - Y_{i})(1 - log({a^{[2]}_i}))
\right \}
\right \} \nonumber \\
&= -1/m\left \{
\frac{Y}{a^{[2]}} + \frac{1 - Y}{1 - a^{[2]}}
\right \} \nonumber \\
&= 1/m \left \{
\frac{a^{[2]} - Y}{a^{[2]}(1 - a^{[2]})}
\right \}
\label{eqn:lossDerivativeA2}
\end{align}
Equation 2
Using this lets go one more step backwards. Using chain rule.
\begin{align}
\frac{\partial{L}}{\partial{z^{[2]}}} &=
\frac{\partial{L}}{\partial{a^{[2]}}} \frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}
\end{align}
Equation 3
This equation basically tells how a change in $z^{[2]}$ will effect $a^{[2]}$ which in turn affects the overall loss. Note that $z^{[2]}$ directly affects the single output neuron here.
Since we use $a^{[2]}$ as a sigmoid activation function, $\frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}$ can be derived as $a^{[2]}(1 - a^{[2]})$. Combining equations 2 and 3 above, we have
\begin{align}
\frac{\partial{L}}{\partial{z^{[2]}}} &= 1/ m \left \{ a^{[2]} - Y \right \}
\end{align}
Lets do a quick dimension check to see of things are in order: $a^{[2]}$ : m*1 and Y : m*1
Equation 4
So far so good. Lets try to derive $\frac{\partial{L}}{\partial{z^{[1]}}}$ using this, Here things start getting complex.
Before we delve further, $z^{[1]}$: $n_h$*1 therefore $\frac{\partial{L}}{\partial{z^{[1]}}}$ will also be of the same dimension. Lets try to derive a single value of this $\frac{\partial{L}}{\partial{z^{[1]}}}$ vector or in short $\partial{z_{i}^{[1]}}$ Consider a general case where $Z_i^{[1]}$ is connected to k neurons.
Therefore any change in $Z_i^{[1]}$ will affect all the neurons in $Z^{[2]}$ , which in turn will affect the loss.
\begin{align}
\frac{\partial{L}}{\partial{z^{[1]}}} &= \sum_{k}{
\frac{\partial{L}}{\partial{z_{k}^{[2]}}}
}
\frac{\partial{z_{k}^{[2]}}}{\partial{z_{i}^{[1]}}}
\end{align}
Notice that the first term in this product is the expression we derived in the equation 4 above. Lets try to delve deeper into $\frac{\partial{z_{k}^{[2]}}}{\partial{z_{i}^{[1]}}}$
Equation 5
\begin{align}
z_{k}^{[2]} &= \sum_{j} W_{kj}^{[1]}a_{j}^{[1]} + b_{k}^{[1]} \nonumber \\
&= \sum_{j} W_{kj}^{[1]}g(z_{j}^{[1]}) + b_{k}^{[1]}
\end{align}
Here, j loops across all the neurons in the hidden layer. The bias term above does not play a role while differentiating with respect to $z^{[1]}$.
Now, since we are differentiating w.r.t $z_i^{[1]}$, the only term which takes effect inside the sum above is when i == j.
In other words,
\begin{align}
\frac{\partial{z_{k}^{[2]}}}{\partial{z_{i}^{[1]}}} &=
\frac{\partial{\ }}{\partial{z_{i}^{[1]}}}
\left \{ W_{ki}^{[1]}g(z_{i}^{[1]}) \right \} + constant
\nonumber \\
\frac{\partial{z_{k}^{[2]}}}{\partial{z_{i}^{[1]}}} &=
W_{ki}^{[1]}g'(z_{i}^{[1]})
\end{align}
Depending on the activation function used in layer 1, we can compute g'($Z_i^{[1]}$). Combining equations 5 and 6 above, we get
Equation 6
\begin{align}
\frac{\partial{L}}{\partial{z_{i}^{[1]}}} &= g'(z_{i}^{[1]}) \otimes \sum_{k}W_{ki}^{[1]} * \partial{z_{i}^{[2]}}
\end{align}
Since, g'($Z_i^{[1]}$) doesn't depend on k, hence it comes out of the summation. Essentially think of it as a multiplier for the summation term. Lets express this in a matrix form. We can write out couple of expanded terms as below:
\begin{align}
\frac{\partial{L}}{\partial{z_{1}^{[1]}}} &= g'(z_{1}^{[1]}) \otimes
\left \{
W_{11}^{[1]} * \partial{z_{1}^{[2]}} +
W_{21}^{[1]} * \partial{z_{2}^{[2]}} +
W_{31}^{[1]} * \partial{z_{3}^{[2]}}
\right \} \nonumber \\
\frac{\partial{L}}{\partial{z_{2}^{[1]}}} &= g'(z_{2}^{[1]}) \otimes
\left \{
W_{12}^{[1]} * \partial{z_{1}^{[2]}} +
W_{22}^{[1]} * \partial{z_{2}^{[2]}} +
W_{32}^{[1]} * \partial{z_{3}^{[2]}}
\right \} \nonumber
\end{align}
Therefore,
\begin{align}
\frac{\partial{L}}{\partial{z^{[k]}}} &= g'(z^{[k]}) \otimes
\left \{
(W^{[k]})^T\frac{\partial{L}}{\partial{z^{[k+1]}}}
\right \}
\end{align}
Equation 7
Lets do a quick dimension check. Since $Z^{[1]}$: $n_h$ * 1, $\frac{\partial{L}}{\partial{z_{1}^{[1]}}}$ should be $n_h$ * 1.
$W^{[1]}$ : $n_y$ * $n_h$ and $\frac{\partial{L}}{\partial{z^{[2]}}}$ : $n_y$ * 1 giving us the desired result.
Using the above equations, let's go ahead and derive the equations for the actual parameters $W^{[1]}$, $W^{[2]}$, $b^{[1]}$ and $b^{[2]}$. Note that in the equation above , g'($Z^{[1]}$) and $w^{[1]}$ can be cached during the forward pass and re-used. Next, we will derive $\frac{\partial{L}}{\partial{W^{[k]}}}$. That is, how a change in weight matrix will affect the loss value. Using chain rule again, we have
\begin{align}
\frac{\partial{L}}{\partial{W^{[k]}}} &= {\frac{\partial{L}}{\partial{z^{[k]}}}}\frac{\partial{z^{[k]}}}{\partial{W^{[k]}}}
\nonumber
\end{align}
Equation 8
Let's zoom into one of the gradient values.
\begin{align}
\frac{\partial{L}}{\partial{W_{ji}^{[k]}}} &= {\frac{\partial{L}}{\partial{z_j^{[k]}}}}\frac{\partial{z_j^{[k]}}}{\partial{W_{ji}^{[k]}}}
\nonumber
\end{align}
More formally, we want to compute how a change in $W_{ji}$ affects $z_j$, and thereby the loss L.
Equation 8
The first term in the above equation is nothing but part of $\partial{z_{k}}$ computed in Equation 7. Let's focus on the second term $\frac{\partial{z^{[k]}}}{\partial{W^{[k]}}}$.
From the above figure, we can see that $z_j$ is only affected by $a_i^{[k - 1]}$ with respect to $W_{ji}$. Hence, we can write
\begin{align}
\frac{\partial{z_j^{[k]}}}{\partial{W_{ji}^{[k]}}} &= \frac{\partial{\ }}{\partial{W_{ji}^{[k]}}} \left \{ a_i^{[k - 1]} W_{ji}^{[k]} + constant \right \} \\
&= a_i^{[k - 1]}
\end{align}
Plugging this into Equation 8, we get
\begin{align}
\frac{\partial{L}}{\partial{W_{ji}^{[k]}}} &= \partial{z_j^{[k]}} a_i^{[k - 1]}
\end{align}
If we expand this derivation to all terms, we will have
\begin{bmatrix}
\partial{z_1^{[k]}} a_1^{[k - 1]} & \partial{z_1^{[k]}} a_2^{[k - 1]} \\ \partial{z_2^{[k]}} a_1^{[k - 1]} & \partial{z_2^{[k]}} a_2^{[k - 1]} \\ \partial{z_3^{[k]}} a_1^{[k - 1]} & \partial{z_3^{[k]}} a_2^{[k - 1]}
\end{bmatrix}
which can be expressed as
\begin{align}
\frac{\partial{L}}{\partial{W^{[k]}}} &= \partial{z^{[k]}} a^{{[k - 1]^T}}
\end{align}
Equation 9
Let's proceed to our last derivation and compute $\frac{\partial{L}}{\partial{b^{[k]}}}$.
\begin{align}
\frac{\partial{L}}{\partial{b^{[k]}}} &= {\frac{\partial{L}}{\partial{z^{[k]}}}}\frac{\partial{z^{[k]}}}{\partial{b^{[k]}}}
\nonumber
\end{align}
The first term in the above equation is nothing but part of $\partial{z_{k}}$ computed in Equation 7. Let's focus on the second term $\frac{\partial{z^{[k]}}}{\partial{b^{[k]}}}$. Now,
$z^{[k]} = W^{[k]}a^{[k-1]} + b^{[k]}$. While differentiating with respect to $b^{[k]}$, the first term here is a constant. Therefore,
Equation 10
\begin{align}
\frac{\partial{z^{[k]}}}{\partial{b^{[k]}}} = \frac{\partial{\ }}{b^{[k]}} (constant + b^{[k]}) = 1
\nonumber
\end{align}
Plugging this back into Equation 10, we get
\begin{align}
\frac{\partial{L}}{\partial{b^{[k]}}} = \partial{z^{[k]}}
\nonumber
\end{align}
Summing across all examples,
Equation 11
\begin{align}
\frac{\partial{L}}{\partial{b^{[k]}}} = \sum_{i = 1}^{m}\partial{z^{{[k]}^i}}
\nonumber
\end{align}
which can be expressed as np.sum($\partial{z^{{[k]}^i}}$, axis = 1, keepdims = true) which is a row-wise sum over all the $\partial{z^{{[k]}^i}}$ values. Let's do a dimension check here. The row-wise sum above will result in $n^{[k]}$ * 1, which is same as $\partial{b^{[k]}}$.
Conclusion
Thus, we have derived all the back-propagation equations:
\begin{align}
\frac{\partial{L}}{\partial{z^{[L]}}} &= 1/ m \left \{ a^{[L]} - Y \right \} \\
\frac{\partial{L}}{\partial{z^{[k]}}} &= g'(z^{[k]}) \otimes
\left \{
(W^{[k]})^T\frac{\partial{L}}{\partial{z^{[k+1]}}}
\right \} \\
\frac{\partial{L}}{\partial{W^{[k]}}} &= \partial{z^{[k]}} a^{{[k - 1]^T}} \\
\frac{\partial{L}}{\partial{b^{[k]}}} &= \sum_{i = 1}^{m}\partial{z^{{[k]}^i}}
\nonumber
\end{align}
As you would have noticed, once you understand the basic recipe, everything follows through. In the next post, we will see how to put these equations into practice for a real example. We will also see how Tensorflow can make your life much easier by computing the derivatives automatically.
Please add your comments below, I will be happy to answer them.
No comments:
Post a Comment