Wednesday, January 2, 2019

Paxos Basics

Paxos represents a family of distributed algorithms used to reach consensus in distributed systems. In this post we will briefly discuss about single decree Paxos. This algorithm was first proposed by Leslie Lamport in his paper "The Part-Time Parliament" and later described more directly in "Paxos Made Simple". Paxos finds a commonplace in almost all the distributed system environments.


Why do we need consensus in the first place?
Most of these distributed systems comprise of tens of thousands of machine working in unison. One employs replication for the purpose of redundancy as well efficiency. But, once you have replicas, you have a challenge to keep them consistent. Failure is a norm in such environments. There are multiple scenarios whereas consensus plays an important role: keeping the data consistent across replicas, leader election or even resource sharing/locking.
Let's keep all these aside and start with a simple example

Say there are 4 friends. They want to do something over the weekend together. They don't care about what they will do specifically, just that they want to do something together. Here is a timeline of various events.

Once a majority agrees on a proposal that is consensus.

In Paxos world, involved parties want to agree on a result, not on their proposal. Communication channels may be faulty, leading to message loss

Paxos basis
  • Proposers: Propose a new value to reach consensus
  • Acceptors: Participate in reaching consensus
  • Learners: Learn about the new value and can be queried
  • Paxos nodes can take multiple roles, even all of them
  • Paxos nodes must know how many acceptors a majority is (quorum)
  • Paxos nodes must be persistent, they can't forget what they accepted
  • Paxos run aims at reaching a single consensus . Can agree on only one value, and this cannot change.
  • Paxos (there might be other complex variations) assumes fail-stop failures and not Byzantine failures. If you want to mutate a value, a different Paxos run will be needed.
With this introduction, lets get into a Paxos run.


Let's say we have one proposer and 5 acceptors. Majority in this case is 3.

Proposer wants to propose a value. It sends a PREPARE ID to all. the acceptors (or a majority of them). ID must be unique across rounds. If it doesn't listen back from acceptors, it will timeout and retry with a new (higher) ID. This ID should not be confused with actual value on which consensus must be reached. This ID can be seen as a round ID.



Acceptor will get a PREPARE message with ID 10. It will check if it has promised to ignore requests with this Id.
If Yes, it ignores the message
If No, it will promise to ignore any requests lower than ID, subject to some conditions which we will see later. If majority of acceptors promise, no ID < 10 will make through


Proposer gets majority of PROMISE messages for ID = 10. It sends an ACCEPT-REQUEST(10,"VAL1") to a majority or all acceptors. Subject to some condition which we will see below.


Acceptor receives an ACCEPT-REQUEST message for an ID, value pair. It checks if it has promised to ignore this ID value.
If Yes then ignore this message.
If No, reply with ACCEPT(ID, val1). Also send to all Learners. If a majority of acceptors accept an (ID, val) pair, consensus is reached. Consensus will always be on this value called the "chosen" value, It will never change this for round.

Proposers and learners get ACCEPT messages for ID, value pair, If a proposer/learner gets majority of accepts for this (ID,value) pair, they know that consensus has been reached on value. The acceptors themselves don't know that consensus has been reached.

Let's say a new proposer comes along and sends a PREPARE 9 message. Since, majority of acceptors have promised ID 10, this request will time-out, Now, say the proposer tries with a higher value 11.



Acceptors checks the ID which is greater than 10. So they reply back with a PROMISE message subject to :

  • If ever accepted anything?
  • If YES, reply with promise ID, accepted ID1, value
  • If No, reply with PROMISE ID.
In this case, since the acceptors have promised a value already, they will send PROMISE(11, 10, "VALUE1").

Proposer gets majority of PROMISE messages, Here it checks:

  • Have i got any already accepted value from promises
  • If Yes, it picks value with the highest ID that it got
  • If No, it picks any value which it wants

So, the run proceeds as,


If a majority of acceptors accept a request with ID and value, consensus has been reached and is that value.
No accept requests with lower ID will be accepted by a majority.
No accept requests with higher ID and a different value will be accepted by a majority, at least an acceptor will piggy back on ID-value pair which will propagate.

What could go wrong ?

  • One or more acceptor fails: still works as long as majority are up
  • Proposer fails in prepare phase: NO-OP , another proposer can make progress
  • Proposer fails in accept phase: Another proposer finishes the job.

Basic Paxos guarantees
  • Safety: Only a single value may be chosen
  • Liveness: As long as majority of servers are up and communicating with reasonable timeliness, some proposed value is eventually chosen. If a value is chosen, servers eventually learn about it.
References:

Friday, December 28, 2018

Distributed systems

Web-scale Distributed systems have become an integral part of today's society, be it searching on the internet or an online e-commerce platform or even social networks. These systems comprise of tens of thousands of machines per data center, which work collaboratively in unison to give an interrupted and performant user experience.

At this scale, system failures are a norm. In addition to shards of data, these systems are replicated across the globe for availability and for efficiency. One has to answer many hard and complex questions to keep such systems running. These systems handle petabytes or exabytes of data and billions of user requests each day.

  1. Partitioning: How do you partition the data (Row ranges or user-specified or consistent hashing, etc..?)
  2. Availability: How do you handle failures and recover from failures?
  3. Replication and Consistency: If the data is replicated, how do you keep the instances consistent? Do you guarantee synchronized consistency or be eventually consistent?
  4. Load balancing: How do you balance the load across these systems? How do you route traffic?
  5. Membership: As machines join and leave the system, how do keep track of this?
  6. Monitoring: How do you keep a know-how of this entire ecosystem? How do you weed out the unhealthy machines? How and when do you update them?

In this post, I will briefly summarize some of the well-known distributed systems and how they seek to answer the questions posed above. Later in this blog, I will describe the proposed ideas in literature on specific problems like caching and routing.


  • Big Table
    This is a scalable, distributed key-value store used by Google as a data-store for many applications. Bigtable formed the basis of other Google systems like Mega-store and Spanner.
  • Dynamo
    This is also a distributed hash table by Amazon for storing small objects. A peculiar thing about this system is its primary-key interface. Dynamo is symmetric, de-centralized, eventually consistent and zero-hop distributed hash table.


Coming up next, Google File System (GFS), EarlyBird index and Kafka.

Bigtable: A Distributed Storage System for Structured Data. (paper link)

  • Bigtable is a sparse, distributed, persistent, multi-dimensional, sorted map.
  • The map is indexed by a row key, column key and a time stamp; each value in the map is an uninterpreted array of bytes.
  • Designed to scale to petabytes, used extensively within Google.
  • Can be used for throughput-oriented, batch-processing jobs to latency-sensitive ones
  • Supports atomic read-modify-write on a row, but transactions do not span multiple rows.
  • Stronger consistency unlike Dynamo. GFS ensures that the logs are flushed to all the replicas for a successful write.
  • Data is row-partitioned, append-only, immutable. Only mutable data structure is memtable, an in-memory data structure. The data is reconciled during compaction runs.

As an example, in a Webtable, one would use URLs as row-keys and various aspects of the pages like anchors, language as column names

Data Model:
Rows:
  • The row keys are arbitrary strings up to 64 KB. Stored as a sorted-order on row-key
Columns:
  1. Within each row, each column key is grouped into sets called column families
    1. Basic unit of access control and data access
  2. For instance, in the Webtable shown above, there can be multiple columns within the "anchor" family

Partitioning:
  1. Table is partitioned by row ranges dynamically. Each row range is called a tablet.
  2. A machine or a tablet-server serves tablet sizes around 100 to 200 MB.
    • Principle of more partitions than machines
    • Easier to load-balance and also facilitates faster recovery
  3. Every tablet represents a single row-range sorted on row key, this provides good locality for data access
For example, pages in same domain are grouped together into contiguous rows by reversing host name components of the URL's like:
  com.a.b/index.html
  com.a.b/sitemap.html

Timestamps
  1. Each cell in the Bigtable can contain multiple versions of the same data (64-bit timestamps).
  2. Number of versions configurable, stored in descending order of time, latest first.

Storage

Bigtable uses Google file System(GFS) for the storage, It depends on the underlying cluster management system for dealing with machine failures, replication, scheduling jobs etc.


Storage Format

The underlying Bigtable data is stored in a SSTable file format(stands for sorted string table). Its a persistent, ordered immutable map from keys to values.

SSTable contains a sequence of blocks . A block index stored at the end of the SSTable is used to locate blocks; the index is loaded in-memory when SSTable is opened. Optionally, an SSTable can be completely mapped into memory.


Implementation:

Chubby: Bigtable uses Chubby (a highly available persistent distributed lock service) for:
  1. Electing one active master at a time
  2. Bootstrap location of Bigtable data
  3. Discover tablet servers, tablet deaths.
  4. Schema information

If Chubby goes down , Bigtable goes unavailable.

One master and many tablet servers
  • Tablet servers can be dynamically added or removed to accommodate cluster changes.
  • Master is responsible for assigning a tablet severs, detecting addition/expiration of tablet servers, balancing tablet-server load, garbage collection of files in GFS, handling schema changes
  • Tablet server handles read and write requests to the tablets it has loaded. Master is lightly loaded in practice.

Tablet Assignments

Bigtable uses Chubby extensively to keep track of servers. Master periodically asks each tablet server for the status of its lock, If a tablet server reports that the lock is lost or there is no response after some attempts, master tries to get an exclusive lock on server file. If master is able to acquire back, then the tablet server is either dead or disconnected. So the master deletes the server file, so that this tablet never serves again. And reallocates the tablet data to the set of unassigned tablets.


Tablet Serving

Persistent state of a tablet is stored in GFS. Updates are committed to a commit log that shares the redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called memtable. Updates are stored in a disk in a sequence of SSTables


Compaction

As write operations execute, size of memtable increases. when it reaches some threshold, new memtable is created. The frozen memtable is converted to SSTable and writen to GFS.
A merging compaction that rewrites all SSTables into exactly on SSTable is called major compaction. This will have no deleted entries.


Refinements
  1. Locality groups: Groups of columns, each with own SSTable.
    • Columns that are mostly not read together can be segregated
    • Faster reads can be in-memory.
  2. Compression
    • Client specific SSTable compression.
    • encode at 100-200 MB/s and decode at 400-1000 MB/s
  3. Read performance
    • Scan Cache: caches key value pairs
    • Block cache: caches SSTables block read from GFS
  4. Bloom Filters
    • To check if an SSTable might contain any data for a row/column pair.
  5. One log file for a tablet server
  6. Only mutable data structure is memtable

Performance
  • Random and sequential writes perform better than random reads as it is append only.
  • Sequential reads better than random.

Tuesday, December 25, 2018

Sampling methods

Inverse transform sampling


Inverse transform sampling or inversion sampling is a method to sample numbers at random from any probability distribution, given its cumulative distribution function.
Lets understand this with an example: You are given an random variable X which is exponential distributed. More formally,
\begin{align} f(x;\lambda) &= { \lambda * exp(-\lambda * x) \space \space x >= 0} \\ &= { 0 \space when \space x \space < \space 0} \end{align} And your goal is to sample from X.
Given this PDF, we can easily come up with a CDF (probability that X will take a value less than or equal to x).
Because it is a probability density, the sum of the probabilities of all possible values for $x$ must be $1$. The CDF gives us the area under the PDF curve at and to the left of a given $x$ value .
What if we map this CDF to a point x in X? It is very easy to sample a point in (0, 1) and if we have the inverse CDF mapping, then we can easily sample points. For the above example, we have

\begin{align} CDF(x) &= { 1-e^{-\lambda x}}\\ x &= \frac{-1}{\lambda}\log(1 - CDF(x))\\ \end{align}

We can now use this inverse CDF function for sampling points in X as shown in the code below.
  

Consider another Problem: Sample points uniformly from a unit circle.
This problem is tricky. The reason is the non-uniform distribution of points within a unit circle. Let's start with simple solution and check if it works.
  1. Sample a point uniformly at random $p ~ (0, 1)$
  2. $r = p \space and \space \theta = \space2 * \space\pi *\space p$
  3. Then $x = rcos\theta\space,\space rsin\theta$
Here is a Matlab/Octave code which performs this.
  
As it is evident from this plot, there are more points towards the center. Why is this case?
This is because as we move further from the center, we need more points to maintain the same density. In other words, points closer to the center have smaller distances between them.
How do we rectify this?

We need to sample points proportional to the area. Lets start with the $CDF(x) = \frac{\pi * x * x}{ \pi * 1 * 1 }$. In the other words, given a radius x, ratio of all the points lying within the circle with x to the area of unit circle.
Therefore, $CDF(x) = x \space * \space x $
In other words, $x = \sqrt{CDF(x)}$
With this correction, our steps become:
  1. Sample a point uniformly at $random \space p \sim (0, 1) $
  2. $r = \sqrt{p} \space and \space \theta = 2 \space* \space\pi \space*\space p $
  3. Then $x = rcos\theta \space,\space rsin\theta$
The below code with this small change now looks like:
  
And now the plot shows a uniform sample of points. Below, we will solve this very problem using an entirely different technique called the Rejection Sampling.

Essentially, inverse transform sampling consists of following steps:
  1. Given a CDF of random variable X, compute $CDF^{-1}(X)$
  2. Sample a point in $[0, 1]$
  3. Using $CDF^{-1}(X)$, compute $x_i$ and include that in the sample.
Pros
Each point is accepted as a sample unlike rejection sampling discussed below.

Cons
The method assumes the existence of an inverse CDF function.

Rejection Sampling


As we saw above, inverse transform sampling requires a CDF. What is we do not have a CDF or the underlying distribution is very complex? Lets start with the same example as above.

Problem: Sample points uniformly from a unit circle.
  1. Imagine a unit square encompassing the circle.
  2. Pick a point P in this unit square uniformly at random. This is easy, you can pick an $x \sim Rand(0, 1)$ and then $y = x.$
  3. Now check if P lies within the circle or outside it. This is not hard, we can simply check if $x_i^2 + y_i^2 <= 1$
  4. If the point lies outside the circle, reject the point. Else include it in our sample.
Below Matlab/Octave code shows this in action.
  

Rejection Sampling can be described as follows:
  1. Let say you have a complex distribution $f$ which is difficult to sample from.
  2. Propose a distribution q such that it encompasses $f$ completely and is easy to sample from
  3. Sample x uniformly at random from the distribution q.
  4. Retain this point if it lies in $f$ or else reject it.
Cons of Rejection Sampling:
  1. It might be difficult to come up with a good proposal distribution.
  2. Lots of points get rejected unlike inversion sampling, so it might a long time to get enough number of samples.

References:
I will be updating this blog with other sampling methods like Importance sampling, MCMC in a short while. Stay tuned!

Friday, December 21, 2018

Reservoir Sampling


Goal : Choose a sample of $k$ items from a list $S$ containing $n$ items, where $n$ is either unknown or very large. For instance, a stream of numbers coming in.

Lets start with the procedure and then prove why this correct.
  1. Keep the first k items in an array (we call this a reservoir)
  2. When the $i^{th}$ item arrives, $i > k$
    1. With the probability $k/i$, keep the new item, discarding an old one selected uniformly at random from the reservoir.
    2. With probability $1-k/i$, reject this new item.

Now, lets take an example and work out this math, Lets take k = 10
  1. For the first $10$ items, add them to the reservoir with probability $1$
  2. When the $11^{th}$ item comes, it is
    • added to the reservoir with probability $10/11$
    • Now, what is the probability of an item $I$ which was already present in the reservoir when the $11^{th}$ item came ?
$p_{required} = $ (the 11th element is chosen and it replaces an element other than $I$ ) or (the 11th element is rejected) \begin{align} &= 10/11 * 9/10 +(1-10/11) \\ &= 9/11+1/11 \\ &= 10/11 \end{align}





So, all the elements in the reservoir stay with the probability of $10/11$ Lets work out what happens when the $12^{th}$ element comes.
  1. It is added to reservoir with probability $10/12$
  2. What is the probability that element $I$ (which was present in the reservoir)" will stay ?
    • $= 10/11 * ( 10/12 * 9/10 + (1-10/12) )$

    • $= 10/11 ( 9/12 + 2/12) = 10/11 * 11/12 = 10/12 $

The $10/11$ factor in the above equation comes from the previous round. Thus, all elements in the reservoir have a probability of $10/12$.

How do we code this up ?
Input: Elements S[1..n], we want to sample k elements where $k << n$
  1. Allocate reservoir $R$ of size $k$
  2. for $i$ = $1$ to $k$
    $R[i] = S[i]$
  3. For $i = k + 1$ to n
    $j$ = $random(1,i)$ // inclusive
    if $j <= k$
    $R[j] = S[i]$

Lets see how this code is equivalent to what we computed earlier, in an inductive way.
  1. Lets assume after round $i - 1$, each element in the reservoir stays with the probability $k/(i-1)$
  2. For round i, we have
    1. New element stays with probability $k/i$
    2. Old element stays with probability :
      $= \frac{k}{i - 1} (\frac{k}{i} * \frac{k - 1}{k} + (1 - \frac{k}{i}))$
      $= \frac{k}{i-1}\frac{i-1}{i} $
      $= \frac{k}{i} $ as required

Reservoir sampling with random sort
  1. Assign a random number as a key to each item
  2. Sort these items and maintain k items with minimum key values
This can be easily implemented using a heap.

Distributed Reservoir Sampling
  • What if we have huge amounts of data? It is desirable to distributed sampling tasks among many machines in parallel
  • A Simple approach: Distributed sort with a random key like above and pick top k elements .
But this is too costly.

Instead, we

  1. Distribute data among m machines
  2. Each machine $c$ does its own sampling using a random key and produces a sample of size <= $k$ items
  3. Collect all samples $n^{'} <= mk$
  4. Sample $k$ items from these $n^{'}$ items using the same random key

If we want to associate a weight $W_i$ with each item, then the random key can be scaled as $r^{1/w_i}$. That is, higher the weight bigger the key.

Sunday, December 16, 2018

Overview of Forward and Backward Propagation in Convolutional Neural Networks

In this post, I will derive the backpropagation equations of a CNN and explain them with some code snippets. The recipe followed is very similar to the deriving backprop equations for a simple feed-forward networks I wrote in this post. If you have not read the earlier post, I would highly recommend you read through that post first.

What is convolution and why is this needed?
Imagine you want to train a DNN on an image, say of size 100 * 100. And you want to connect this to a fully-connected layer with 100 neurons. The weight matrix we will need to learn will be of size (100 * 10000) or a million parameters. Extending this to larger images and complex networks will require even make the parameter space even larger. Instead of going down this path, we try to learn some features of smaller dimensions from the image and use this to build our network. Here is where convolution comes in.

A convolution operation is depicted below. We refer to W matrix below as a filter.


where
$z_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21}+W_{22}X_{22}$
$z_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22}+W_{22}X_{23}$
$z_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31}+W_{22}X_{32}$
$z_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32}+W_{22}X_{33}$

As a concrete example, let's pass an image through the two filters given below:
  • # The first filter converts the image to grayscale.
    • $w[0, 0, :, :] = [[0, 0, 0], [0, 0.3, 0], [0, 0, 0]]$
    • $w[0, 1, :, :] = [[0, 0, 0], [0, 0.6, 0], [0, 0, 0]]$
    • $w[0, 2, :, :] = [[0, 0, 0], [0, 0.1, 0], [0, 0, 0]]$
  • # Second filter detects horizontal edges in the blue channel.
    • $w[1, 2, :, :] = [[1, 2, 1], [0, 0, 0], [-1, -2, -1]]$


This is fascinating. You can imagine coming up with filters to detect eyes or face or edges. In convolutional neural networks, we try to learn these filters.

Max-pool layers

A CNN network usually has a conv layer as shown above coupled with a max-pool (or an average pool) layer. Max pooling is done by applying a max filter to (usually) non-overlapping sub-regions of the initial representation.


The intuition behind this operation is it reduces the dimensionality even more and thereby reducing the number of parameters to learn. It tries to retain the dominant feature in a region. As the parameter space is reduced, it reduces over-fitting.

In this post, I am not going into all the details of a CNN. Please go through the references to learn more. Instead, I am going to explain code snippets for both forward propagation and backward propagation, as well as explain the equations. The code snippets are taken from the Deep Learning specialization in Coursera.

Let’s start with couple of helper functions:
  • a) zero_pad$(X, pad)$ takes in a dataset $X$ and pads it with zeros. In this function, it’s important to take account which dimensions you are padding. We want to pad an image along its height and width. Check which dimensions of the input capture these and accordingly pad.
  • b) conv_single_step$(a\_slice\_prev, W, b)$ takes in single slice of previous activation and a single filter and performs the convolution operation as shown in the figure above. Additionally, since each filter also has a bias term associated with it, it adds that as well.


Forward propagation
Now that we have the helper functions ready, lets do a forward pass with convolutional layer. The naïve implementation is not complex, albeit the different indices. The code below takes different slices of input and applies the convolution operation.

Backward propagation (Conv layer)
Let's delve into the backprop equations for a convolution layer. We will start with the equations mentioned above: \begin{align} z_{11} = W_{11}X_{11} + W_{12}X_{12} + W_{21}X_{21} + W_{22}X_{22} + b^{[1]} \\ z_{12} = W_{11}X_{12} + W_{12}X_{13} + W_{21}X_{22} + W_{22}X_{23} + b^{[1]} \\ z_{21} = W_{11}X_{21} + W_{12}X_{22} + W_{21}X_{31} + W_{22}X_{32} + b^{[1]} \\ z_{22} = W_{11}X_{22} + W_{12}X_{23} + W_{21}X_{32} + W_{22}X_{33} + b^{[1]} \end{align} Let's try to compute $\frac{\partial{L}}{\partial{W_{11}}}$. If we note the above equations, $W_{11}$ affects all the $z$ values. Using chain rule, we can express \begin{align} \frac{\partial{L}}{\partial{W_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * \frac{\partial{z_{11}}}{\partial{W_{11}}} + \frac{\partial{L}}{\partial{z_{12}}} * \frac{\partial{z_{12}}}{\partial{W_{11}}} + \frac{\partial{L}}{\partial{z_{21}}} * \frac{\partial{z_{21}}}{\partial{W_{11}}} + \frac{\partial{L}}{\partial{z_{22}}} * \frac{\partial{z_{22}}}{\partial{W_{11}}} \end{align} This will evaluate to \begin{align} \frac{\partial{L}}{\partial{W_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * X_{11} + \frac{\partial{L}}{\partial{z_{12}}} * X_{12} + \frac{\partial{L}}{\partial{z_{21}}} * X_{21} + \frac{\partial{L}}{\partial{z_{22}}} * X_{22} \end{align} This is nothing but a convolution operation as depicted below:


In general, we multiply the gradients $Z$ with the corresponding input slice (earlier activation). And since the filter is shared across inputs, we just add up the gradients. Generalizing this equation, we can express this as
\begin{align} \partial{W_C} += \sum_{h = 0}^{n_H}\sum_{w = 0}^{n_W}a_{slice} * \partial{Z_{hw}} \end{align}
$\partial{W_C}$ represents the derivative of one filter with respect to the loss.

Let's proceed and compute $\frac{\partial{L}}{\partial{b^{[1]}}}$. \begin{align} \frac{\partial{L}}{\partial{b^{[1]}}} = \frac{\partial{L}}{\partial{z_{11}}} * \frac{\partial{z_{11}}}{\partial{b^{[1]}}} + \frac{\partial{L}}{\partial{z_{12}}} * \frac{\partial{z_{12}}}{\partial{b^{[1]}}} + \frac{\partial{L}}{\partial{z_{21}}} * \frac{\partial{z_{21}}}{\partial{b^{[1]}}} + \frac{\partial{L}}{\partial{z_{22}}} * \frac{\partial{z_{22}}}{\partial{b^{[1]}}} \end{align} This is nothing but \begin{align} \frac{\partial{L}}{\partial{b^{[1]}}} = \frac{\partial{L}}{\partial{z_{11}}} + \frac{\partial{L}}{\partial{z_{12}}} + \frac{\partial{L}}{\partial{z_{21}}} + \frac{\partial{L}}{\partial{z_{22}}} \end{align}
\begin{align} \frac{\partial{L}}{\partial{b}} = \sum_h\sum_w\partial{Z_{hw}} \end{align}
Let's proceed to our last derivation: Differentiating with respect to the input to the conv layer $\frac{\partial{L}}{\partial{A}}$. Let's look at one such expression: \begin{align} \frac{\partial{L}}{\partial{A_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * \frac{\partial{z_{11}}}{\partial{A_{11}}} + \frac{\partial{L}}{\partial{z_{12}}} * \frac{\partial{z_{12}}}{\partial{A_{11}}} + \frac{\partial{L}}{\partial{z_{21}}} * \frac{\partial{z_{21}}}{\partial{A_{11}}} + \frac{\partial{L}}{\partial{z_{22}}} * \frac{\partial{z_{22}}}{\partial{A_{11}}} \end{align} This evaluates to: \begin{align} \frac{\partial{L}}{\partial{X_{11}}} = \frac{\partial{L}}{\partial{z_{11}}} * W_{11} + \frac{\partial{L}}{\partial{z_{12}}} * 0 + \frac{\partial{L}}{\partial{z_{21}}} * 0 + \frac{\partial{L}}{\partial{z_{22}}} * 0 \end{align} As we see this is some sort of a strange convolution (Its rather called full convolution). Only select terms from the filters take part in the derivation. Either we can do this way, or realize that we can do a simple convolution with a zero-padded matrix, which is what is done in the code below.
In general,
\begin{align} \partial{A} += \sum_{h = 0}^{n_H}\sum_{w = 0}^{n_W}W_c * \partial{Z_{hw}} \end{align}
Where $W_c$ is a filter and $\partial{Z_{hw}}$ is a scalar corresponding to the gradient of the cost with respect to the output of the conv layer Z at the hth row and with column. Let's look at these equations used in the code below.


Backward propagation (MAX pool layer)
Since there is no parameter to learn in this layer, there is no explicit gradient value to deal with. Still, we have to pass on the gradient from this layer to earlier layer. If you look at this operator, it takes a max of all the input values in a slice. In effect, only this value (which happens to be the maximum in the slice) will affect the gradient. So, during the forward pass we create a mask where 1 denotes a max value and 0 denotes the other values. Then we just multiply the gradient with this mask to get the change to the corresponding input during the backward propagation.

Please post your comments below, I will be happy to answer them.

References:
https://www.coursera.org/learn/convolutional-neural-networks
http://cs231n.github.io/assignments2018/assignment2/
https://becominghuman.ai/back-propagation-in-convolutional-neural-networks-intuition-and-code-714ef1c38199
https://medium.com/@2017csm1006/forward-and-backpropagation-in-convolutional-neural-network-4dfa96d7b37e

Thursday, December 13, 2018

DNN code from scratch

DNN from scratch In this post, I will use the equations derived in my earlier post and build a simple neural network from scratch. Also, I will give the corresponding Tensor Flow Implementation just to give a feel for how much easier it is to implement with TensorFlow. The code snippets are taken from the references.

Setup

In [381]:
import tensorflow as tf
In [382]:
from tensorflow.examples.tutorials.mnist import input_data
In [383]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
import pandas as pd
from sklearn import preprocessing

%matplotlib inline

np.random.seed(1) # set a seed so that the results are consistent

Data

We use a binary valued occupancy dataset for this excercise. You can obtain this dataset from https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+ We will use pandas to load this into a dataframe.

In [384]:
df_train =  pd.read_csv('occupancy_data\\datatraining.txt')
In [385]:
df_train = df_train.drop(['date'], axis = 1)
df_train.head()
Out[385]:
Temperature Humidity Light CO2 HumidityRatio Occupancy
1 23.18 27.2720 426.0 721.25 0.004793 1
2 23.15 27.2675 429.5 714.00 0.004783 1
3 23.15 27.2450 426.0 713.50 0.004779 1
4 23.15 27.2000 426.0 708.25 0.004772 1
5 23.10 27.2000 426.0 704.50 0.004757 1
In [386]:
nTrain = df_train.shape[0]
In [387]:
Ytrain = np.transpose(df_train['Occupancy'].values)
Ytrain = Ytrain.reshape(1, nTrain)
In [388]:
df_train = df_train.drop(['Occupancy'], axis = 1)
df_train.head()
Out[388]:
Temperature Humidity Light CO2 HumidityRatio
1 23.18 27.2720 426.0 721.25 0.004793
2 23.15 27.2675 429.5 714.00 0.004783
3 23.15 27.2450 426.0 713.50 0.004779
4 23.15 27.2000 426.0 708.25 0.004772
5 23.10 27.2000 426.0 704.50 0.004757
In [389]:
Xtrain = np.transpose(df_train.values)
In [390]:
Xtrain = preprocessing.normalize(Xtrain, axis = 1)
In [391]:
print(Xtrain.shape) ## each column now represents a data instance
print(Ytrain.shape)
(5, 8143)
(1, 8143)
In [392]:
df_test =  pd.read_csv('occupancy_data\\datatest.txt')
In [393]:
df_test = df_test.drop(['date'], axis = 1)
In [394]:
nTest = df_test.shape[0]
In [395]:
Ytest = np.transpose(df_test['Occupancy'].values)
Ytest = Ytest.reshape(1, nTest)
In [396]:
df_test = df_test.drop(['Occupancy'], axis = 1)
df_test.head()
Out[396]:
Temperature Humidity Light CO2 HumidityRatio
140 23.7000 26.272 585.200000 749.200000 0.004764
141 23.7180 26.290 578.400000 760.400000 0.004773
142 23.7300 26.230 572.666667 769.666667 0.004765
143 23.7225 26.125 493.750000 774.750000 0.004744
144 23.7540 26.200 488.600000 779.000000 0.004767
In [397]:
Xtest = np.transpose(df_test.values)
In [398]:
Xtest = preprocessing.normalize(Xtest, axis = 1)
In [399]:
print(Xtest.shape) ## each column now represents a data instance
print(Ytest.shape)
(5, 2665)
(1, 2665)

DNN code

In [400]:
def layer_sizes(X, Y, numberHiddenLayers):
    """
    Arguments:
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)
    
    Returns:
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer
    """
    
    n_x = X.shape[0] # size of input layer
    n_h = numberHiddenLayers
    n_y = Y.shape[0] # size of output layer
    
    return (n_x, n_h, n_y)
In [401]:
def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    
    Returns:
    params -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """
    
    np.random.seed(2) 

    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.rand(n_y, n_h) * 0.01
    b2 = np.zeros((n_y, 1))
        
    assert (W1.shape == (n_h, n_x))
    assert (b1.shape == (n_h, 1))
    assert (W2.shape == (n_y, n_h))
    assert (b2.shape == (n_y, 1))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters
In [402]:
def sigmoid(x):
    return 1/(1 + np.exp(-x))
In [403]:
def forward_propagation(X, parameters):
    """
    Argument:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing your parameters (output of initialization function)
    
    Returns:
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    # Implement Forward Propagation to calculate A2 (probabilities)
    Z1 = np.dot(W1, X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)
    
    assert(A2.shape == (1, X.shape[1]))
    
    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}
    
    return A2, cache
In [404]:
def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)
    
    Arguments:
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2
    
    Returns:
    cost -- cross-entropy cost given equation (13)
    """
    
    m = Y.shape[1] # number of example
    
    # Retrieve W1 and W2 from parameters
    W1 = parameters['W1']
    W2 = parameters['W2']
    
    # Compute the cross-entropy cost
    logprobs = np.multiply(np.log(A2), Y) + np.multiply((1 - Y), np.log(1 - A2))
    cost = - np.sum(logprobs) / m
    
    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. 
                                # E.g., turns [[17]] into 17 
    assert(isinstance(cost, float))
    
    return cost
In [405]:
# GRADED FUNCTION: backward_propagation

def backward_propagation(parameters, cache, X, Y):
    """
    Implement the backward propagation using the instructions above.
    
    Arguments:
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    
    Returns:
    grads -- python dictionary containing your gradients with respect to different parameters
    """
    m = X.shape[1]
    
    # First, retrieve W1 and W2 from the dictionary "parameters".
    W1 = parameters['W1']
    W2 = parameters['W2']
        
    # Retrieve also A1 and A2 from dictionary "cache".
    A1 = cache['A1']
    A2 = cache['A2']
    
    # Backward propagation: calculate dW1, db1, dW2, db2. 
    dZ2= A2 - Y
    dW2 = (1 / m) * np.dot(dZ2, A1.T)
    db2 = (1 / m) * np.sum(dZ2, axis=1, keepdims=True)
    dZ1 = np.multiply(np.dot(W2.T, dZ2), 1 - np.power(A1, 2))
    dW1 = (1 / m) * np.dot(dZ1, X.T)
    db1 = (1 / m) * np.sum(dZ1, axis=1, keepdims=True)
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads
In [406]:
def update_parameters(parameters, grads, learning_rate=1.2):
    """
    Updates parameters using the gradient descent update rule given above
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients 
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    """
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
   
    # Retrieve each gradient from the dictionary "grads"
    dW1 = grads['dW1']
    db1 = grads['db1']
    dW2 = grads['dW2']
    db2 = grads['db2']
    
    # Update rule for each parameter
    W1 = W1 - learning_rate * dW1
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters
In [407]:
def nn_model(X, Y, n_h, num_iterations=10000, print_cost=False):
    """
    Arguments:
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """
    
    np.random.seed(3)
    n_x = layer_sizes(X, Y, n_h)[0]
    n_y = layer_sizes(X, Y, n_h)[2]
    
    # Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters".
    parameters = initialize_parameters(n_x, n_h, n_y)
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
   
    # Loop (gradient descent)

    for i in range(0, num_iterations):
        ## Get the next batch of images to train on
        ## print("Iteration :" + str(i))
        # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
        A2, cache = forward_propagation(X, parameters)
        
        # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
        cost = compute_cost(A2, Y, parameters)
 
        # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        grads = backward_propagation(parameters, cache, X, Y)
 
        # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        parameters = update_parameters(parameters, grads)
        
        # Print the cost every 1000 iterations
        if print_cost and i % 1000 == 0:
            print ("Cost after iteration %i: %f" % (i, cost))

    return parameters
In [408]:
parameters = nn_model(Xtrain, Ytrain, 5, num_iterations=10000, print_cost=True)
Cost after iteration 0: 0.693147
Cost after iteration 1000: 0.512973
Cost after iteration 2000: 0.420606
Cost after iteration 3000: 0.326401
Cost after iteration 4000: 0.264507
Cost after iteration 5000: 0.226376
Cost after iteration 6000: 0.065537
Cost after iteration 7000: 0.064785
Cost after iteration 8000: 0.064379
Cost after iteration 9000: 0.063994
In [648]:
def predict(parameters, X):
    """
    Using the learned parameters, predicts a class for each example in X
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    X -- input data of size (n_x, m)
    
    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """
    
    # Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold.
    ### START CODE HERE ### (≈ 2 lines of code)
    A2, cache = forward_propagation(X, parameters)
    predictions = np.round(A2)
    ### END CODE HERE ###
    
    return predictions
In [649]:
predictions = predict(parameters, Xtest)
In [650]:
predictions
Out[650]:
array([[ 1.,  1.,  1., ...,  1.,  1.,  1.]])
In [654]:
print ('Accuracy: %d' % float((np.dot(Ytest,predictions.T) + np.dot(1-Ytest,1-predictions.T))/float(Ytest.size)*100) + '%')
Accuracy: 97%
In [ ]:
 

Below is a TensorFlow of the same network a above. Observe that most of the heavy-lifting is done by TensorFlow and the amount of code reduces significantly. Also note that we only define the layers and the forward prop equations, TF takes care of backprop on its own.

DNN from scratch-Copy4

Setup

In [134]:
import tensorflow as tf
In [135]:
from tensorflow.examples.tutorials.mnist import input_data
In [136]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
import pandas as pd
from sklearn import preprocessing

%matplotlib inline

np.random.seed(1) # set a seed so that the results are consistent

Network using Tensorflow

Define the variables and placeholders

In [153]:
tf.reset_default_graph()
In [176]:
X = tf.placeholder(tf.float32,shape=[5,None])

W1 = tf.Variable(tf.random_normal([5,5])) 

b1 = tf.Variable(tf.zeros([5, 1]))

W2 = tf.Variable(tf.random_normal([1,5]))

b2 = tf.Variable(tf.zeros([1, 1]))

y_true = tf.placeholder(tf.float32,[1,None])

Create the Graph

In [177]:
h = tf.matmul(W1, X) + b1 
h = tf.nn.tanh(h)
y = tf.matmul(W2, h) + b2
y = tf.sigmoid(y)

Define the loss and optimizer

In [192]:
loss = tf.reduce_mean(-y_true * tf.log(y) - (1 - y_true) * tf.log(1 - y))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5)

train = optimizer.minimize(loss)

Train the model

In [196]:
# Train the model

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    # Train the model for 1000 steps on the training set
    # Using built in batch feeder from mnist for convenience
    
    for step in range(10000):
        _,cost = sess.run([train, loss] , feed_dict={X:Xtrain, y_true:Ytrain})
        if (step % 1000 == 0):
            print("Cost after iteration %i: %f" % (step, cost))

    ypred = y.eval({X: Xtest})
    pred = np.round(ypred)
    correct_predictions = np.equal(pred, Ytest)
    print("\nAccuracy:", np.sum(correct_predictions)/Xtest.shape[1])
Cost after iteration 0: 0.690228
Cost after iteration 1000: 0.430140
Cost after iteration 2000: 0.450875
Cost after iteration 3000: 0.398449
Cost after iteration 4000: 0.301684
Cost after iteration 5000: 0.075488
Cost after iteration 6000: 0.068829
Cost after iteration 7000: 0.066808
Cost after iteration 8000: 0.065850
Cost after iteration 9000: 0.065270

Accuracy: 0.971857410882
In [ ]:
 

References
https://www.coursera.org/learn/neural-networks-deep-learning?specialization=deep-learning
https://github.com/cs231n/cs231n.github.io