SystemAntics: DNNs, ML, Systems and Code...: distributed systems

Web-scale Distributed systems have become an integral part of today's society, be it searching on the internet or an online e-commerce platform or even social networks. These systems comprise of tens of thousands of machines per data center, which work collaboratively in unison to give an interrupted and performant user experience.

At this scale, system failures are a norm. In addition to shards of data, these systems are replicated across the globe for availability and for efficiency. One has to answer many hard and complex questions to keep such systems running. These systems handle petabytes or exabytes of data and billions of user requests each day.

Partitioning: How do you partition the data (Row ranges or user-specified or consistent hashing, etc..?)
Availability: How do you handle failures and recover from failures?
Replication and Consistency: If the data is replicated, how do you keep the instances consistent? Do you guarantee synchronized consistency or be eventually consistent?
Load balancing: How do you balance the load across these systems? How do you route traffic?
Membership: As machines join and leave the system, how do keep track of this?
Monitoring: How do you keep a know-how of this entire ecosystem? How do you weed out the unhealthy machines? How and when do you update them?

In this post, I will briefly summarize some of the well-known distributed systems and how they seek to answer the questions posed above. Later in this blog, I will describe the proposed ideas in literature on specific problems like caching and routing.

Big Table
This is a scalable, distributed key-value store used by Google as a data-store for many applications. Bigtable formed the basis of other Google systems like Mega-store and Spanner.
Dynamo
This is also a distributed hash table by Amazon for storing small objects. A peculiar thing about this system is its primary-key interface. Dynamo is symmetric, de-centralized, eventually consistent and zero-hop distributed hash table.

Coming up next, Google File System (GFS), EarlyBird index and Kafka.

Bigtable: A Distributed Storage System for Structured Data. (paper link)

Bigtable is a sparse, distributed, persistent, multi-dimensional, sorted map.
The map is indexed by a row key, column key and a time stamp; each value in the map is an uninterpreted array of bytes.
Designed to scale to petabytes, used extensively within Google.
Can be used for throughput-oriented, batch-processing jobs to latency-sensitive ones
Supports atomic read-modify-write on a row, but transactions do not span multiple rows.
Stronger consistency unlike Dynamo. GFS ensures that the logs are flushed to all the replicas for a successful write.
Data is row-partitioned, append-only, immutable. Only mutable data structure is memtable, an in-memory data structure. The data is reconciled during compaction runs.

As an example, in a Webtable, one would use URLs as row-keys and various aspects of the pages like anchors, language as column names

Data Model:
Rows:

The row keys are arbitrary strings up to 64 KB. Stored as a sorted-order on row-key

Columns:

Within each row, each column key is grouped into sets called column families
1. Basic unit of access control and data access
For instance, in the Webtable shown above, there can be multiple columns within the "anchor" family

Partitioning:

Table is partitioned by row ranges dynamically. Each row range is called a tablet.
A machine or a tablet-server serves tablet sizes around 100 to 200 MB.
- Principle of more partitions than machines
- Easier to load-balance and also facilitates faster recovery
Every tablet represents a single row-range sorted on row key, this provides good locality for data access

For example, pages in same domain are grouped together into contiguous rows by reversing host name components of the URL's like:
com.a.b/index.html
com.a.b/sitemap.html

Timestamps

Each cell in the Bigtable can contain multiple versions of the same data (64-bit timestamps).
Number of versions configurable, stored in descending order of time, latest first.

Storage

Bigtable uses Google file System(GFS) for the storage, It depends on the underlying cluster management system for dealing with machine failures, replication, scheduling jobs etc.

Storage Format

The underlying Bigtable data is stored in a SSTable file format(stands for sorted string table). Its a persistent, ordered immutable map from keys to values.

SSTable contains a sequence of blocks . A block index stored at the end of the SSTable is used to locate blocks; the index is loaded in-memory when SSTable is opened. Optionally, an SSTable can be completely mapped into memory.

Implementation:

Chubby: Bigtable uses Chubby (a highly available persistent distributed lock service) for:

Electing one active master at a time
Bootstrap location of Bigtable data
Discover tablet servers, tablet deaths.
Schema information

If Chubby goes down , Bigtable goes unavailable.

One master and many tablet servers

Tablet servers can be dynamically added or removed to accommodate cluster changes.
Master is responsible for assigning a tablet severs, detecting addition/expiration of tablet servers, balancing tablet-server load, garbage collection of files in GFS, handling schema changes
Tablet server handles read and write requests to the tablets it has loaded. Master is lightly loaded in practice.

Tablet Assignments

Bigtable uses Chubby extensively to keep track of servers. Master periodically asks each tablet server for the status of its lock, If a tablet server reports that the lock is lost or there is no response after some attempts, master tries to get an exclusive lock on server file. If master is able to acquire back, then the tablet server is either dead or disconnected. So the master deletes the server file, so that this tablet never serves again. And reallocates the tablet data to the set of unassigned tablets.

Tablet Serving

Persistent state of a tablet is stored in GFS. Updates are committed to a commit log that shares the redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called memtable. Updates are stored in a disk in a sequence of SSTables

Compaction

As write operations execute, size of memtable increases. when it reaches some threshold, new memtable is created. The frozen memtable is converted to SSTable and writen to GFS.
A merging compaction that rewrites all SSTables into exactly on SSTable is called major compaction. This will have no deleted entries.

Refinements

Locality groups: Groups of columns, each with own SSTable.
- Columns that are mostly not read together can be segregated
- Faster reads can be in-memory.
Compression
- Client specific SSTable compression.
- encode at 100-200 MB/s and decode at 400-1000 MB/s
Read performance
- Scan Cache: caches key value pairs
- Block cache: caches SSTables block read from GFS
Bloom Filters
- To check if an SSTable might contain any data for a row/column pair.
One log file for a tablet server
Only mutable data structure is memtable

Performance

Random and sequential writes perform better than random reads as it is append only.
Sequential reads better than random.

Dynamo: Amazon's highly available key-value store (paper link)

Highly available, distributed key-value store.
Zero-hop distributed hash table.
Eventual consistency model unlike Bigtable.
Always writable, decentralized, symmetric architecture unlike Bigtable.
Primary key interface, used for storing value blobs < 1 MB.
Conflict resolution during reads, client assisted conflict resolution

Amazon Dynamo is highly available key-value store used by Amazon services, It is not as general as Bigtable. Some of the features are driven by the peculiar requirement of the services its caters to.

API
Dynamo provides a simple primary key interface, Value is stored as a binary blob and are mostly small, less than 1 MB. This is very different from BigTable which is more general.

Partitioning
Dynamo employs consistent hashing to distribute load across multiple storage hosts. Instead of mapping a node to a single point in the circle, each nodes gets assigned to multiple points in the ring(virtual nodes), This also helps during failures.

Replication
Dynamo replicates data on multiple hosts configured per instance. Each key k is assigned to a coordinator which is in-charge of replication. The coordinator stores key locally in addition to replicating these keys at N-1 clockwise successor nodes in the ring.The list of nodes that is responsible for storing a particular key is called the preference list.

Data Versioning
Dynamo provides eventual consistency unlike BigTable. It uses vector clocks (explained in this post) to capture causality between two different versions of an objects. In most of the cases, the conflicts can be resolved using "last write wins". If there are conflicts even after this, Dynamo puts the onus on the client application to resolve conflicts. This is very peculiar to Dynamo, a write will go through even if only one replica is alive.Thus, items added to a cart are never lost.

Consistency
The consistency protocol used in Dynamo is similar to those used in quorum system with two configurable values: $R$ and $W$. $R$ is the minimum number of nodes that should take part in a successful read operation, likewise $W$ deals with write operation, Setting $R + W > N$ yields quorum . $R$ and $W$ values can be configured by clients, with minimum $W$ value been $1$.

Hinted handoff
If a node A goes down and another node D is handling ts traffic, the replica sent to D will have a hint in its metadata that the intended recipient of the replica is A. D keeps this data separately and passes it on to A on its recovery.
Dynamo also maintains Merkle trees on each node to compare replicas. These trees are recomputed when nodes join or leave the system.

Membership
Dynamo employs a gossip-based protocol to propagate membership changes and maintains an eventually consistent view. The key mappings stored at different Dynamo nodes are reconciled during the membership gossip itself. Also, certain nodes are designated as seed modes and known to all the nodes.
When a node gets added, it is allocated a range of keys, These keys are re-allocated from their already serving nodes to this node. A pull-based approach is employed as it scales better

Earlybird: Real-Time Search at Twitter (paper link)

Real-time search engine used in Twitter
Ingest content rapidly and make it searchable immediately
Single writer, multi-reader concurrency model
Temporal signal dominates, postings list traversed inverse chronological order
Index divided into a read-only index and a write-friendly index

Scale

Twitter has >100 million active users worldwide who collectively post >250 million tweets per day. 2 billion queries a day with an average latency of 50 ms. Tweets are searchable within 10 secs after creation

Tweets enter the ingestion pipeline where they are decorated with additional metadata. To handle large volumes, tweets are hash positioned across early-bird servers, which index the tweets

Ranking of tweets depends on:

Static signals (indexing time)
Resonance signals (structural)
User-specific information

At a higher level, each Earlybird instance manages ~12 index segments each holding ~8.4 million tweets. At any time, one index segment is actively being modified whereas others are read-only segments are filled up one after the other.

Each term is assigned an unique, monotonically increasing term id. Term data is held in parallel arrays and contain :

Number of postings for that term
Pointer to tail of postings list

Arrays are used for postings list instead of linked-lists and as they provide fast random access and have memory locality.

Active Index Segment

Each posting is a 31-bit integer, 24-nits of document id and 8-bits for term position
Postings traversal in reverse time-order implies traversing array backwards

Space allocation for postings list

Four separate "pools" for holding postings, $2^{15}$ blocks
In each pool, slice is allocated to hold individual postings for a term. The pool sizes are $2^1,2^4,2^7 \space and \space 2^{11}$ respectively. When a term is first encountered, $2^1$ integer slice is allocated in first pool, when first pool runs out of space, space is allocated in second pool of $2^4$ size and so on
First 32 bits of slice stores the previous pointer
Dictionary holds the tail of the postings list for that tern
For the $2^{11}$ pool, multiple slices can be allocated in the same pool

Optimized read-only index segments

Once an active index segment fills up, it is converted into an optimized read-only index.
Since the size of postings are known, it can be laid out efficiently in reverse chronological order in memory

Concurrency management

A variable maxDoc is maintained as a volatile variable and incremented when all terms in a tweet are indexed
So postings with values greater than maxDoc can be skipped while serving queries

Performance

17000 QPS on a single index segment containing 16 million tweets with a 95th percentile latency < 100ms and 99th percentile latency < 200ms using 8 searcher threads
On a fully-loaded Earlybird server with 144 million tweets, 5000 QPS with 95 percentile latency of 120 ms and 99th percentile of 170ms

Kafka: a Distributed Messaging System for Log Processing (paper link)

Distributed messaging system, suitable for both offline and online message consumption.
Scalable, high throughput, can be viewed as a large scale producer-consumer system.
Pull-based model, minimal message header, no Ack between broker and producer.
Kafka combines the benefits of traditional log aggregators and messaging systems

Data Model

A stream of messages of a particular type is defined by a topic.
Producers publish messages to a topic. These are stored in machines called brokers.
Consumers subscribe to one or more topics from brokers and pull data

Each message stream provides an iterator interface over the continual stream of messages being produced.
Kafka supports point-to-point delivery model as well as a publish-subscribe model. In the later case, each consumer retrieves its own copy of data.
To balance load, topic is divided into multiple partitions and each broker stores one or more partitions

Partition Layout

Each partition of a topic corresponds to a logical log.
Each log is implemented as a set of segment files (~1GB).
Every time a producer publishes a message to a partition, broker simply appends the message to last segment file. CRC is stored for each message to detect corruptions.
Segment file are batched together and written to disk.
A message is only exposed to consumers after it is flushed.

Message

Messages don't have explicit id. It is addressed by its logical offset in the log.
Broker maintains a sorted list of offsets in memory, including offset of first message in every segment file.
Messages are grouped together at the producer/consumer level for efficiency
No explicit caching of messages in memory of Kafka layer. Instead, relies on page cache.

Stateless Broker

Consumer manage state
Messages expire after configured amount of time

Consumer Groups

Each message is delivered to one of the consumers in the group
Employs Zookeeper as a consensus service
- Detecting addition and removal of brokers and consumers
- Triggering re-balance process in each consumer when above events happen
- Keeping track of consumed offset of each partition

Brokers registery stored in Zookeeper()

Hostname
Port Number
Set of a topics and partition

Consumer registry stored in zookeeper

Consumer group info, subscribed topics
During initial startup of consumer or when consumer is notified about broker/consumer change through watcher, consumer initiates a re-balance process to determine subset of partitions it should consume from
By reading broker and consumer registries, consumer first computer set $P_t$ of partitions available for each subscribed topic T & set $C_t$ of consumers subscribing to T. It then range-partitions $P_t$ into |$C_t$| chunks and deterministically picks one chunk to own

Delivery Guarantees

At-least once delivery
Messages to a consumer in-order. There is no such guarantee for messages across partitions
Application should take care of duplicates

TAO: Facebook’s Distributed Data Store for the Social Graph (paper link)

Distributed, read-optimized graph serving data store used at Facebook
API tailored for serving social graph
Favors availability and performance over strong consistency
Billion reads per second

TAO is a graph aware cache which sits on top of MySQL persistence layer. Some of its features are driven by the peculiar nature of the workload

The social graph data is strongly interconnected and moves very fast. This obviates the need of user-based partitioning or the likes. The data-centers host the complete social graph.
Reads outnumber writes by a significant margin.
Serving slightly stale data in some scenario is fine.

TAO is designed to serve few billion read requests per second. It can be thought of as a read-through, write through cache . The content shown to a user is curated at query time (pull the social graph).

Data Model:

TAO represents its object as typed nodes and one or more associations between them:
- $Object\space:\space(id) \to(otype, (key, value)*)$
- $Association\space:\space (id1, atype, id2) \to (time,(key, value)*)$

As an example, users are represented as objects and their friendships, authorship's are represented as associations. Repeatable actions are represented as objects. In many cases, an association is tightly coupled with an inverse edge

API

All the clients talk to TAO which in turn talks to the persistence layer. The API provides operations to create new objects to retrieve, update or delete objects.
The TAO cache also handles complex logic. For instance if an edge is created, TAO can check if an inverse edge needs to be created and create it. Some of the API queries look like:

$assoc-add(id1, atype, id2, time,(k \to v)*)$
$assoc-count(id1, atype)$

Architecture

At the lowest layer is a MySQL database layer. Data is divided into logical shards which can be seen as a logical database. Database servers are responsible for one or more shards. The number of shards far exceeds number of servers, for load balancing and efficiency
Each object contains an embedded shard-id identifying the hosting shard. Objects are bound to a shard for their lifetime, This guarantees a consistent user experience. For instance, the number of likes must always be an increasing sequence.

Caching Layer

Consists of multiple cache servers that together form a tier. A tier is collectively capable of responding to any TAO request. Clients talk to this layer directly. TAO in-memory cache contains objects, association lists and association counts maintained in LRU manner.

Leaders and Followers

Large tier sizes can become potential hot-spots and cause problems. Therefore, the cache is split into two levels: a leader tier and multiple follower tiers. The architecture looks like:

Read Path:

Clients query the TAO followers . If the query can be satisfied, clients get the response. If not, the query gets forwarded to the local region's leader tier. In cache of a cache miss here, the local region's database is queried.

Write Path:

The path looks the same, $followers \space \to \space leader \space \to \space database $. However, a write is forwarded to the region with the master database. Master region is controlled separately for each shard and is automatically switched over to recover from database failures.
The data written in cache gets available only after it is flushed to the master database. The updates are propagated asynchronously to the slave regions, for efficiency at the expense of database freshness . Reads are always satisfied within a region.

Consistency and fault tolerance

Objects and associations in TAO are eventually consistent. Slaves are usually less than a second away from the master. TAO writes to MySQL synchronously, hence the write path is always consistent. When the master database is down, one of its slaves is automatically promoted to be new master. When the region's slave database is down, cache misses goes to the TAO leaders in region hosting database master. If the leader cache server fails, followers route requests around it. Reads are directly served from the DB. Writes are rerouted to a random member of leader's tier. If a TAO follower fails, followers in other tiers share the responsibility of serving failed hosts shards.

SystemAntics: DNNs, ML, Systems and Code...

Wednesday, January 2, 2019

Paxos Basics

Friday, December 28, 2018

Distributed systems

Bigtable: A Distributed Storage System for Structured Data. (paper link)

Dynamo: Amazon's highly available key-value store (paper link)

Earlybird: Real-Time Search at Twitter (paper link)

Kafka: a Distributed Messaging System for Log Processing (paper link)

TAO: Facebook’s Distributed Data Store for the Social Graph (paper link)