Showing posts with label distributed systems. Show all posts
Showing posts with label distributed systems. Show all posts

Wednesday, January 2, 2019

Paxos Basics

Paxos represents a family of distributed algorithms used to reach consensus in distributed systems. In this post we will briefly discuss about single decree Paxos. This algorithm was first proposed by Leslie Lamport in his paper "The Part-Time Parliament" and later described more directly in "Paxos Made Simple". Paxos finds a commonplace in almost all the distributed system environments.


Why do we need consensus in the first place?
Most of these distributed systems comprise of tens of thousands of machine working in unison. One employs replication for the purpose of redundancy as well efficiency. But, once you have replicas, you have a challenge to keep them consistent. Failure is a norm in such environments. There are multiple scenarios whereas consensus plays an important role: keeping the data consistent across replicas, leader election or even resource sharing/locking.
Let's keep all these aside and start with a simple example

Say there are 4 friends. They want to do something over the weekend together. They don't care about what they will do specifically, just that they want to do something together. Here is a timeline of various events.

Once a majority agrees on a proposal that is consensus.

In Paxos world, involved parties want to agree on a result, not on their proposal. Communication channels may be faulty, leading to message loss

Paxos basis
  • Proposers: Propose a new value to reach consensus
  • Acceptors: Participate in reaching consensus
  • Learners: Learn about the new value and can be queried
  • Paxos nodes can take multiple roles, even all of them
  • Paxos nodes must know how many acceptors a majority is (quorum)
  • Paxos nodes must be persistent, they can't forget what they accepted
  • Paxos run aims at reaching a single consensus . Can agree on only one value, and this cannot change.
  • Paxos (there might be other complex variations) assumes fail-stop failures and not Byzantine failures. If you want to mutate a value, a different Paxos run will be needed.
With this introduction, lets get into a Paxos run.


Let's say we have one proposer and 5 acceptors. Majority in this case is 3.

Proposer wants to propose a value. It sends a PREPARE ID to all. the acceptors (or a majority of them). ID must be unique across rounds. If it doesn't listen back from acceptors, it will timeout and retry with a new (higher) ID. This ID should not be confused with actual value on which consensus must be reached. This ID can be seen as a round ID.



Acceptor will get a PREPARE message with ID 10. It will check if it has promised to ignore requests with this Id.
If Yes, it ignores the message
If No, it will promise to ignore any requests lower than ID, subject to some conditions which we will see later. If majority of acceptors promise, no ID < 10 will make through


Proposer gets majority of PROMISE messages for ID = 10. It sends an ACCEPT-REQUEST(10,"VAL1") to a majority or all acceptors. Subject to some condition which we will see below.


Acceptor receives an ACCEPT-REQUEST message for an ID, value pair. It checks if it has promised to ignore this ID value.
If Yes then ignore this message.
If No, reply with ACCEPT(ID, val1). Also send to all Learners. If a majority of acceptors accept an (ID, val) pair, consensus is reached. Consensus will always be on this value called the "chosen" value, It will never change this for round.

Proposers and learners get ACCEPT messages for ID, value pair, If a proposer/learner gets majority of accepts for this (ID,value) pair, they know that consensus has been reached on value. The acceptors themselves don't know that consensus has been reached.

Let's say a new proposer comes along and sends a PREPARE 9 message. Since, majority of acceptors have promised ID 10, this request will time-out, Now, say the proposer tries with a higher value 11.



Acceptors checks the ID which is greater than 10. So they reply back with a PROMISE message subject to :

  • If ever accepted anything?
  • If YES, reply with promise ID, accepted ID1, value
  • If No, reply with PROMISE ID.
In this case, since the acceptors have promised a value already, they will send PROMISE(11, 10, "VALUE1").

Proposer gets majority of PROMISE messages, Here it checks:

  • Have i got any already accepted value from promises
  • If Yes, it picks value with the highest ID that it got
  • If No, it picks any value which it wants

So, the run proceeds as,


If a majority of acceptors accept a request with ID and value, consensus has been reached and is that value.
No accept requests with lower ID will be accepted by a majority.
No accept requests with higher ID and a different value will be accepted by a majority, at least an acceptor will piggy back on ID-value pair which will propagate.

What could go wrong ?

  • One or more acceptor fails: still works as long as majority are up
  • Proposer fails in prepare phase: NO-OP , another proposer can make progress
  • Proposer fails in accept phase: Another proposer finishes the job.

Basic Paxos guarantees
  • Safety: Only a single value may be chosen
  • Liveness: As long as majority of servers are up and communicating with reasonable timeliness, some proposed value is eventually chosen. If a value is chosen, servers eventually learn about it.
References:

Friday, December 28, 2018

Distributed systems

Web-scale Distributed systems have become an integral part of today's society, be it searching on the internet or an online e-commerce platform or even social networks. These systems comprise of tens of thousands of machines per data center, which work collaboratively in unison to give an interrupted and performant user experience.

At this scale, system failures are a norm. In addition to shards of data, these systems are replicated across the globe for availability and for efficiency. One has to answer many hard and complex questions to keep such systems running. These systems handle petabytes or exabytes of data and billions of user requests each day.

  1. Partitioning: How do you partition the data (Row ranges or user-specified or consistent hashing, etc..?)
  2. Availability: How do you handle failures and recover from failures?
  3. Replication and Consistency: If the data is replicated, how do you keep the instances consistent? Do you guarantee synchronized consistency or be eventually consistent?
  4. Load balancing: How do you balance the load across these systems? How do you route traffic?
  5. Membership: As machines join and leave the system, how do keep track of this?
  6. Monitoring: How do you keep a know-how of this entire ecosystem? How do you weed out the unhealthy machines? How and when do you update them?

In this post, I will briefly summarize some of the well-known distributed systems and how they seek to answer the questions posed above. Later in this blog, I will describe the proposed ideas in literature on specific problems like caching and routing.


  • Big Table
    This is a scalable, distributed key-value store used by Google as a data-store for many applications. Bigtable formed the basis of other Google systems like Mega-store and Spanner.
  • Dynamo
    This is also a distributed hash table by Amazon for storing small objects. A peculiar thing about this system is its primary-key interface. Dynamo is symmetric, de-centralized, eventually consistent and zero-hop distributed hash table.


Coming up next, Google File System (GFS), EarlyBird index and Kafka.

Bigtable: A Distributed Storage System for Structured Data. (paper link)

  • Bigtable is a sparse, distributed, persistent, multi-dimensional, sorted map.
  • The map is indexed by a row key, column key and a time stamp; each value in the map is an uninterpreted array of bytes.
  • Designed to scale to petabytes, used extensively within Google.
  • Can be used for throughput-oriented, batch-processing jobs to latency-sensitive ones
  • Supports atomic read-modify-write on a row, but transactions do not span multiple rows.
  • Stronger consistency unlike Dynamo. GFS ensures that the logs are flushed to all the replicas for a successful write.
  • Data is row-partitioned, append-only, immutable. Only mutable data structure is memtable, an in-memory data structure. The data is reconciled during compaction runs.

As an example, in a Webtable, one would use URLs as row-keys and various aspects of the pages like anchors, language as column names

Data Model:
Rows:
  • The row keys are arbitrary strings up to 64 KB. Stored as a sorted-order on row-key
Columns:
  1. Within each row, each column key is grouped into sets called column families
    1. Basic unit of access control and data access
  2. For instance, in the Webtable shown above, there can be multiple columns within the "anchor" family

Partitioning:
  1. Table is partitioned by row ranges dynamically. Each row range is called a tablet.
  2. A machine or a tablet-server serves tablet sizes around 100 to 200 MB.
    • Principle of more partitions than machines
    • Easier to load-balance and also facilitates faster recovery
  3. Every tablet represents a single row-range sorted on row key, this provides good locality for data access
For example, pages in same domain are grouped together into contiguous rows by reversing host name components of the URL's like:
  com.a.b/index.html
  com.a.b/sitemap.html

Timestamps
  1. Each cell in the Bigtable can contain multiple versions of the same data (64-bit timestamps).
  2. Number of versions configurable, stored in descending order of time, latest first.

Storage

Bigtable uses Google file System(GFS) for the storage, It depends on the underlying cluster management system for dealing with machine failures, replication, scheduling jobs etc.


Storage Format

The underlying Bigtable data is stored in a SSTable file format(stands for sorted string table). Its a persistent, ordered immutable map from keys to values.

SSTable contains a sequence of blocks . A block index stored at the end of the SSTable is used to locate blocks; the index is loaded in-memory when SSTable is opened. Optionally, an SSTable can be completely mapped into memory.


Implementation:

Chubby: Bigtable uses Chubby (a highly available persistent distributed lock service) for:
  1. Electing one active master at a time
  2. Bootstrap location of Bigtable data
  3. Discover tablet servers, tablet deaths.
  4. Schema information

If Chubby goes down , Bigtable goes unavailable.

One master and many tablet servers
  • Tablet servers can be dynamically added or removed to accommodate cluster changes.
  • Master is responsible for assigning a tablet severs, detecting addition/expiration of tablet servers, balancing tablet-server load, garbage collection of files in GFS, handling schema changes
  • Tablet server handles read and write requests to the tablets it has loaded. Master is lightly loaded in practice.

Tablet Assignments

Bigtable uses Chubby extensively to keep track of servers. Master periodically asks each tablet server for the status of its lock, If a tablet server reports that the lock is lost or there is no response after some attempts, master tries to get an exclusive lock on server file. If master is able to acquire back, then the tablet server is either dead or disconnected. So the master deletes the server file, so that this tablet never serves again. And reallocates the tablet data to the set of unassigned tablets.


Tablet Serving

Persistent state of a tablet is stored in GFS. Updates are committed to a commit log that shares the redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called memtable. Updates are stored in a disk in a sequence of SSTables


Compaction

As write operations execute, size of memtable increases. when it reaches some threshold, new memtable is created. The frozen memtable is converted to SSTable and writen to GFS.
A merging compaction that rewrites all SSTables into exactly on SSTable is called major compaction. This will have no deleted entries.


Refinements
  1. Locality groups: Groups of columns, each with own SSTable.
    • Columns that are mostly not read together can be segregated
    • Faster reads can be in-memory.
  2. Compression
    • Client specific SSTable compression.
    • encode at 100-200 MB/s and decode at 400-1000 MB/s
  3. Read performance
    • Scan Cache: caches key value pairs
    • Block cache: caches SSTables block read from GFS
  4. Bloom Filters
    • To check if an SSTable might contain any data for a row/column pair.
  5. One log file for a tablet server
  6. Only mutable data structure is memtable

Performance
  • Random and sequential writes perform better than random reads as it is append only.
  • Sequential reads better than random.