Showing posts with label backprop. Show all posts
Showing posts with label backprop. Show all posts

Sunday, December 16, 2018

Overview of Forward and Backward Propagation in Convolutional Neural Networks

In this post, I will derive the backpropagation equations of a CNN and explain them with some code snippets. The recipe followed is very similar to the deriving backprop equations for a simple feed-forward networks I wrote in this post. If you have not read the earlier post, I would highly recommend you read through that post first.

What is convolution and why is this needed?
Imagine you want to train a DNN on an image, say of size 100 * 100. And you want to connect this to a fully-connected layer with 100 neurons. The weight matrix we will need to learn will be of size (100 * 10000) or a million parameters. Extending this to larger images and complex networks will require even make the parameter space even larger. Instead of going down this path, we try to learn some features of smaller dimensions from the image and use this to build our network. Here is where convolution comes in.

A convolution operation is depicted below. We refer to W matrix below as a filter.


where
$z_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21}+W_{22}X_{22}$
$z_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22}+W_{22}X_{23}$
$z_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31}+W_{22}X_{32}$
$z_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32}+W_{22}X_{33}$

As a concrete example, let's pass an image through the two filters given below:
  • # The first filter converts the image to grayscale.
    • $w[0, 0, :, :] = [[0, 0, 0], [0, 0.3, 0], [0, 0, 0]]$
    • $w[0, 1, :, :] = [[0, 0, 0], [0, 0.6, 0], [0, 0, 0]]$
    • $w[0, 2, :, :] = [[0, 0, 0], [0, 0.1, 0], [0, 0, 0]]$
  • # Second filter detects horizontal edges in the blue channel.
    • $w[1, 2, :, :] = [[1, 2, 1], [0, 0, 0], [-1, -2, -1]]$


This is fascinating. You can imagine coming up with filters to detect eyes or face or edges. In convolutional neural networks, we try to learn these filters.

Max-pool layers

A CNN network usually has a conv layer as shown above coupled with a max-pool (or an average pool) layer. Max pooling is done by applying a max filter to (usually) non-overlapping sub-regions of the initial representation.


The intuition behind this operation is it reduces the dimensionality even more and thereby reducing the number of parameters to learn. It tries to retain the dominant feature in a region. As the parameter space is reduced, it reduces over-fitting.

In this post, I am not going into all the details of a CNN. Please go through the references to learn more. Instead, I am going to explain code snippets for both forward propagation and backward propagation, as well as explain the equations. The code snippets are taken from the Deep Learning specialization in Coursera.

Let’s start with couple of helper functions:
  • a) zero_pad$(X, pad)$ takes in a dataset $X$ and pads it with zeros. In this function, it’s important to take account which dimensions you are padding. We want to pad an image along its height and width. Check which dimensions of the input capture these and accordingly pad.
  • b) conv_single_step$(a\_slice\_prev, W, b)$ takes in single slice of previous activation and a single filter and performs the convolution operation as shown in the figure above. Additionally, since each filter also has a bias term associated with it, it adds that as well.


Forward propagation
Now that we have the helper functions ready, lets do a forward pass with convolutional layer. The naïve implementation is not complex, albeit the different indices. The code below takes different slices of input and applies the convolution operation.

Backward propagation (Conv layer)
Let's delve into the backprop equations for a convolution layer. We will start with the equations mentioned above: \begin{align} z_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21} + W_{22}X_{22} + b^{[1]} \\ z_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22} + W_{22}X_{23} + b^{[1]} \\ z_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31} + W_{22}X_{32} + b^{[1]} \\ z_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32} + W_{22}X_{33} + b^{[1]} \end{align} Let's try to compute $\frac{\partial{L}}{\partial{W_{11}}}$. If we note the above equations, $W_{11}$ affects all the $z$ values. Using chain rule, we can express \begin{align} \frac{\partial{L}}{\partial{W_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * \frac{\partial{z_{11}}}{\partial{W_{11}}} + \frac{\partial{L}}{\partial{z_{12}}} * \frac{\partial{z_{12}}}{\partial{W_{11}}} + \frac{\partial{L}}{\partial{z_{21}}} * \frac{\partial{z_{21}}}{\partial{W_{11}}} + \frac{\partial{L}}{\partial{z_{22}}} * \frac{\partial{z_{22}}}{\partial{W_{11}}} \end{align} This will evaluate to \begin{align} \frac{\partial{L}}{\partial{W_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * X_{11} + \frac{\partial{L}}{\partial{z_{12}}} * X_{12} + \frac{\partial{L}}{\partial{z_{21}}} * X_{21} + \frac{\partial{L}}{\partial{z_{22}}} * X_{22} \end{align} This is nothing but a convolution operation as depicted below:


In general, we multiply the gradients $Z$ with the corresponding input slice (earlier activation). And since the filter is shared across inputs, we just add up the gradients. Generalizing this equation, we can express this as
\begin{align} \partial{W_C} += \sum_{h = 0}^{n_H}\sum_{w = 0}^{n_W}a_{slice} * \partial{Z_{hw}} \end{align}
$\partial{W_C}$ represents the derivative of one filter with respect to the loss.

Let's proceed and compute $\frac{\partial{L}}{\partial{b^{[1]}}}$. \begin{align} \frac{\partial{L}}{\partial{b^{[1]}}} = \frac{\partial{L}}{\partial{z_{11}}} * \frac{\partial{z_{11}}}{\partial{b^{[1]}}} + \frac{\partial{L}}{\partial{z_{12}}} * \frac{\partial{z_{12}}}{\partial{b^{[1]}}} + \frac{\partial{L}}{\partial{z_{21}}} * \frac{\partial{z_{21}}}{\partial{b^{[1]}}} + \frac{\partial{L}}{\partial{z_{22}}} * \frac{\partial{z_{22}}}{\partial{b^{[1]}}} \end{align} This is nothing but \begin{align} \frac{\partial{L}}{\partial{b^{[1]}}} = \frac{\partial{L}}{\partial{z_{11}}} + \frac{\partial{L}}{\partial{z_{12}}} + \frac{\partial{L}}{\partial{z_{21}}} + \frac{\partial{L}}{\partial{z_{22}}} \end{align}
\begin{align} \frac{\partial{L}}{\partial{b}} = \sum_h\sum_w\partial{Z_{hw}} \end{align}
Let's proceed to our last derivation: Differentiating with respect to the input to the conv layer $\frac{\partial{L}}{\partial{A}}$. Let's look at one such expression: \begin{align} \frac{\partial{L}}{\partial{A_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * \frac{\partial{z_{11}}}{\partial{A_{11}}} + \frac{\partial{L}}{\partial{z_{12}}} * \frac{\partial{z_{12}}}{\partial{A_{11}}} + \frac{\partial{L}}{\partial{z_{21}}} * \frac{\partial{z_{21}}}{\partial{A_{11}}} + \frac{\partial{L}}{\partial{z_{22}}} * \frac{\partial{z_{22}}}{\partial{A_{11}}} \end{align} This evaluates to: \begin{align} \frac{\partial{L}}{\partial{X_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * W_{11} + \frac{\partial{L}}{\partial{z_{12}}} * 0 + \frac{\partial{L}}{\partial{z_{21}}} * 0 + \frac{\partial{L}}{\partial{z_{22}}} * 0 \end{align} As we see this is some sort of a strange convolution (Its rather called full convolution). Only select terms from the filters take part in the derivation. Either we can do this way, or realize that we can do a simple convolution with a zero-padded matrix, which is what is done in the code below.
In general,
\begin{align} \partial{A} += \sum_{h = 0}^{n_H}\sum_{w = 0}^{n_W}W_c * \partial{Z_{hw}} \end{align}
Where $W_c$ is a filter and $\partial{Z_{hw}}$ is a scalar corresponding to the gradient of the cost with respect to the output of the conv layer Z at the hth row and with column. Let's look at these equations used in the code below.


Backward propagation (MAX pool layer)
Since there is no parameter to learn in this layer, there is no explicit gradient value to deal with. Still, we have to pass on the gradient from this layer to earlier layer. If you look at this operator, it takes a max of all the input values in a slice. In effect, only this value (which happens to be the maximum in the slice) will affect the gradient. So, during the forward pass we create a mask where 1 denotes a max value and 0 denotes the other values. Then we just multiply the gradient with this mask to get the change to the corresponding input during the backward propagation.

Please post your comments below, I will be happy to answer them.

References:
https://www.coursera.org/learn/convolutional-neural-networks
http://cs231n.github.io/assignments2018/assignment2/
https://becominghuman.ai/back-propagation-in-convolutional-neural-networks-intuition-and-code-714ef1c38199
https://medium.com/@2017csm1006/forward-and-backpropagation-in-convolutional-neural-network-4dfa96d7b37e

Thursday, December 13, 2018

DNN code from scratch

DNN from scratch In this post, I will use the equations derived in my earlier post and build a simple neural network from scratch. Also, I will give the corresponding Tensor Flow Implementation just to give a feel for how much easier it is to implement with TensorFlow. The code snippets are taken from the references.

Setup

In [381]:
import tensorflow as tf
In [382]:
from tensorflow.examples.tutorials.mnist import input_data
In [383]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
import pandas as pd
from sklearn import preprocessing

%matplotlib inline

np.random.seed(1) # set a seed so that the results are consistent

Data

We use a binary valued occupancy dataset for this excercise. You can obtain this dataset from https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+ We will use pandas to load this into a dataframe.

In [384]:
df_train =  pd.read_csv('occupancy_data\\datatraining.txt')
In [385]:
df_train = df_train.drop(['date'], axis = 1)
df_train.head()
Out[385]:
Temperature Humidity Light CO2 HumidityRatio Occupancy
1 23.18 27.2720 426.0 721.25 0.004793 1
2 23.15 27.2675 429.5 714.00 0.004783 1
3 23.15 27.2450 426.0 713.50 0.004779 1
4 23.15 27.2000 426.0 708.25 0.004772 1
5 23.10 27.2000 426.0 704.50 0.004757 1
In [386]:
nTrain = df_train.shape[0]
In [387]:
Ytrain = np.transpose(df_train['Occupancy'].values)
Ytrain = Ytrain.reshape(1, nTrain)
In [388]:
df_train = df_train.drop(['Occupancy'], axis = 1)
df_train.head()
Out[388]:
Temperature Humidity Light CO2 HumidityRatio
1 23.18 27.2720 426.0 721.25 0.004793
2 23.15 27.2675 429.5 714.00 0.004783
3 23.15 27.2450 426.0 713.50 0.004779
4 23.15 27.2000 426.0 708.25 0.004772
5 23.10 27.2000 426.0 704.50 0.004757
In [389]:
Xtrain = np.transpose(df_train.values)
In [390]:
Xtrain = preprocessing.normalize(Xtrain, axis = 1)
In [391]:
print(Xtrain.shape) ## each column now represents a data instance
print(Ytrain.shape)
(5, 8143)
(1, 8143)
In [392]:
df_test =  pd.read_csv('occupancy_data\\datatest.txt')
In [393]:
df_test = df_test.drop(['date'], axis = 1)
In [394]:
nTest = df_test.shape[0]
In [395]:
Ytest = np.transpose(df_test['Occupancy'].values)
Ytest = Ytest.reshape(1, nTest)
In [396]:
df_test = df_test.drop(['Occupancy'], axis = 1)
df_test.head()
Out[396]:
Temperature Humidity Light CO2 HumidityRatio
140 23.7000 26.272 585.200000 749.200000 0.004764
141 23.7180 26.290 578.400000 760.400000 0.004773
142 23.7300 26.230 572.666667 769.666667 0.004765
143 23.7225 26.125 493.750000 774.750000 0.004744
144 23.7540 26.200 488.600000 779.000000 0.004767
In [397]:
Xtest = np.transpose(df_test.values)
In [398]:
Xtest = preprocessing.normalize(Xtest, axis = 1)
In [399]:
print(Xtest.shape) ## each column now represents a data instance
print(Ytest.shape)
(5, 2665)
(1, 2665)

DNN code

In [400]:
def layer_sizes(X, Y, numberHiddenLayers):
    """
    Arguments:
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)
    
    Returns:
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer
    """
    
    n_x = X.shape[0] # size of input layer
    n_h = numberHiddenLayers
    n_y = Y.shape[0] # size of output layer
    
    return (n_x, n_h, n_y)
In [401]:
def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    
    Returns:
    params -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """
    
    np.random.seed(2) 

    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.rand(n_y, n_h) * 0.01
    b2 = np.zeros((n_y, 1))
        
    assert (W1.shape == (n_h, n_x))
    assert (b1.shape == (n_h, 1))
    assert (W2.shape == (n_y, n_h))
    assert (b2.shape == (n_y, 1))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters
In [402]:
def sigmoid(x):
    return 1/(1 + np.exp(-x))
In [403]:
def forward_propagation(X, parameters):
    """
    Argument:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing your parameters (output of initialization function)
    
    Returns:
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    # Implement Forward Propagation to calculate A2 (probabilities)
    Z1 = np.dot(W1, X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)
    
    assert(A2.shape == (1, X.shape[1]))
    
    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}
    
    return A2, cache
In [404]:
def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)
    
    Arguments:
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2
    
    Returns:
    cost -- cross-entropy cost given equation (13)
    """
    
    m = Y.shape[1] # number of example
    
    # Retrieve W1 and W2 from parameters
    W1 = parameters['W1']
    W2 = parameters['W2']
    
    # Compute the cross-entropy cost
    logprobs = np.multiply(np.log(A2), Y) + np.multiply((1 - Y), np.log(1 - A2))
    cost = - np.sum(logprobs) / m
    
    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. 
                                # E.g., turns [[17]] into 17 
    assert(isinstance(cost, float))
    
    return cost
In [405]:
# GRADED FUNCTION: backward_propagation

def backward_propagation(parameters, cache, X, Y):
    """
    Implement the backward propagation using the instructions above.
    
    Arguments:
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    
    Returns:
    grads -- python dictionary containing your gradients with respect to different parameters
    """
    m = X.shape[1]
    
    # First, retrieve W1 and W2 from the dictionary "parameters".
    W1 = parameters['W1']
    W2 = parameters['W2']
        
    # Retrieve also A1 and A2 from dictionary "cache".
    A1 = cache['A1']
    A2 = cache['A2']
    
    # Backward propagation: calculate dW1, db1, dW2, db2. 
    dZ2= A2 - Y
    dW2 = (1 / m) * np.dot(dZ2, A1.T)
    db2 = (1 / m) * np.sum(dZ2, axis=1, keepdims=True)
    dZ1 = np.multiply(np.dot(W2.T, dZ2), 1 - np.power(A1, 2))
    dW1 = (1 / m) * np.dot(dZ1, X.T)
    db1 = (1 / m) * np.sum(dZ1, axis=1, keepdims=True)
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads
In [406]:
def update_parameters(parameters, grads, learning_rate=1.2):
    """
    Updates parameters using the gradient descent update rule given above
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients 
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    """
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
   
    # Retrieve each gradient from the dictionary "grads"
    dW1 = grads['dW1']
    db1 = grads['db1']
    dW2 = grads['dW2']
    db2 = grads['db2']
    
    # Update rule for each parameter
    W1 = W1 - learning_rate * dW1
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters
In [407]:
def nn_model(X, Y, n_h, num_iterations=10000, print_cost=False):
    """
    Arguments:
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """
    
    np.random.seed(3)
    n_x = layer_sizes(X, Y, n_h)[0]
    n_y = layer_sizes(X, Y, n_h)[2]
    
    # Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters".
    parameters = initialize_parameters(n_x, n_h, n_y)
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
   
    # Loop (gradient descent)

    for i in range(0, num_iterations):
        ## Get the next batch of images to train on
        ## print("Iteration :" + str(i))
        # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
        A2, cache = forward_propagation(X, parameters)
        
        # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
        cost = compute_cost(A2, Y, parameters)
 
        # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        grads = backward_propagation(parameters, cache, X, Y)
 
        # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        parameters = update_parameters(parameters, grads)
        
        # Print the cost every 1000 iterations
        if print_cost and i % 1000 == 0:
            print ("Cost after iteration %i: %f" % (i, cost))

    return parameters
In [408]:
parameters = nn_model(Xtrain, Ytrain, 5, num_iterations=10000, print_cost=True)
Cost after iteration 0: 0.693147
Cost after iteration 1000: 0.512973
Cost after iteration 2000: 0.420606
Cost after iteration 3000: 0.326401
Cost after iteration 4000: 0.264507
Cost after iteration 5000: 0.226376
Cost after iteration 6000: 0.065537
Cost after iteration 7000: 0.064785
Cost after iteration 8000: 0.064379
Cost after iteration 9000: 0.063994
In [648]:
def predict(parameters, X):
    """
    Using the learned parameters, predicts a class for each example in X
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    X -- input data of size (n_x, m)
    
    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """
    
    # Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold.
    ### START CODE HERE ### (≈ 2 lines of code)
    A2, cache = forward_propagation(X, parameters)
    predictions = np.round(A2)
    ### END CODE HERE ###
    
    return predictions
In [649]:
predictions = predict(parameters, Xtest)
In [650]:
predictions
Out[650]:
array([[ 1.,  1.,  1., ...,  1.,  1.,  1.]])
In [654]:
print ('Accuracy: %d' % float((np.dot(Ytest,predictions.T) + np.dot(1-Ytest,1-predictions.T))/float(Ytest.size)*100) + '%')
Accuracy: 97%
In [ ]:
 

Below is a TensorFlow of the same network a above. Observe that most of the heavy-lifting is done by TensorFlow and the amount of code reduces significantly. Also note that we only define the layers and the forward prop equations, TF takes care of backprop on its own.

DNN from scratch-Copy4

Setup

In [134]:
import tensorflow as tf
In [135]:
from tensorflow.examples.tutorials.mnist import input_data
In [136]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
import pandas as pd
from sklearn import preprocessing

%matplotlib inline

np.random.seed(1) # set a seed so that the results are consistent

Network using Tensorflow

Define the variables and placeholders

In [153]:
tf.reset_default_graph()
In [176]:
X = tf.placeholder(tf.float32,shape=[5,None])

W1 = tf.Variable(tf.random_normal([5,5])) 

b1 = tf.Variable(tf.zeros([5, 1]))

W2 = tf.Variable(tf.random_normal([1,5]))

b2 = tf.Variable(tf.zeros([1, 1]))

y_true = tf.placeholder(tf.float32,[1,None])

Create the Graph

In [177]:
h = tf.matmul(W1, X) + b1 
h = tf.nn.tanh(h)
y = tf.matmul(W2, h) + b2
y = tf.sigmoid(y)

Define the loss and optimizer

In [192]:
loss = tf.reduce_mean(-y_true * tf.log(y) - (1 - y_true) * tf.log(1 - y))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5)

train = optimizer.minimize(loss)

Train the model

In [196]:
# Train the model

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    # Train the model for 1000 steps on the training set
    # Using built in batch feeder from mnist for convenience
    
    for step in range(10000):
        _,cost = sess.run([train, loss] , feed_dict={X:Xtrain, y_true:Ytrain})
        if (step % 1000 == 0):
            print("Cost after iteration %i: %f" % (step, cost))

    ypred = y.eval({X: Xtest})
    pred = np.round(ypred)
    correct_predictions = np.equal(pred, Ytest)
    print("\nAccuracy:", np.sum(correct_predictions)/Xtest.shape[1])
Cost after iteration 0: 0.690228
Cost after iteration 1000: 0.430140
Cost after iteration 2000: 0.450875
Cost after iteration 3000: 0.398449
Cost after iteration 4000: 0.301684
Cost after iteration 5000: 0.075488
Cost after iteration 6000: 0.068829
Cost after iteration 7000: 0.066808
Cost after iteration 8000: 0.065850
Cost after iteration 9000: 0.065270

Accuracy: 0.971857410882
In [ ]:
 

References
https://www.coursera.org/learn/neural-networks-deep-learning?specialization=deep-learning
https://github.com/cs231n/cs231n.github.io