Web-scale Distributed systems have become an integral part of today's society, be it searching on the internet or
an online e-commerce platform or even social networks. These systems comprise of tens of thousands of machines per
data center, which work collaboratively in unison to give an interrupted and performant user experience.
At this scale, system failures are a norm. In addition to shards of data, these systems are replicated across the globe for availability and for efficiency. One has to answer many hard and complex questions to keep such systems running. These systems handle petabytes or exabytes of data and billions of user requests each day.
- Partitioning: How do you partition the data (Row ranges or user-specified or consistent hashing, etc..?)
- Availability: How do you handle failures and recover from failures?
- Replication and Consistency: If the data is replicated, how do you keep the instances consistent? Do you guarantee synchronized consistency or be eventually consistent?
- Load balancing: How do you balance the load across these systems? How do you route traffic?
- Membership: As machines join and leave the system, how do keep track of this?
- Monitoring: How do you keep a know-how of this entire ecosystem? How do you weed out the unhealthy machines? How and when do you update them?
In this post, I will briefly summarize some of the well-known distributed systems and how they seek to answer the questions posed above. Later in this blog, I will describe the proposed ideas in literature on specific problems like caching and routing.
- Big Table
This is a scalable, distributed key-value store used by Google as a data-store for many applications. Bigtable formed the basis of other Google systems like Mega-store and Spanner. - Dynamo
This is also a distributed hash table by Amazon for storing small objects. A peculiar thing about this system is its primary-key interface. Dynamo is symmetric, de-centralized, eventually consistent and zero-hop distributed hash table.
Coming up next, Google File System (GFS), EarlyBird index and Kafka.
Bigtable: A Distributed Storage System for Structured Data. (paper link)
- Bigtable is a sparse, distributed, persistent, multi-dimensional, sorted map.
- The map is indexed by a row key, column key and a time stamp; each value in the map is an uninterpreted array of bytes.
- Designed to scale to petabytes, used extensively within Google.
- Can be used for throughput-oriented, batch-processing jobs to latency-sensitive ones
- Supports atomic read-modify-write on a row, but transactions do not span multiple rows.
- Stronger consistency unlike Dynamo. GFS ensures that the logs are flushed to all the replicas for a successful write.
- Data is row-partitioned, append-only, immutable. Only mutable data structure is memtable, an in-memory data structure. The data is reconciled during compaction runs.
Data Model:
Rows:
- The row keys are arbitrary strings up to 64 KB. Stored as a sorted-order on row-key
- Within each row, each column key is grouped into sets called column families
- Basic unit of access control and data access
- For instance, in the Webtable shown above, there can be multiple columns within the "anchor" family
Partitioning:
- Table is partitioned by row ranges dynamically. Each row range is called a tablet.
- A machine or a tablet-server serves tablet sizes around 100 to 200 MB.
- Principle of more partitions than machines
- Easier to load-balance and also facilitates faster recovery
- Every tablet represents a single row-range sorted on row key, this provides good locality for data access
com.a.b/index.html
com.a.b/sitemap.html
Timestamps
- Each cell in the Bigtable can contain multiple versions of the same data (64-bit timestamps).
- Number of versions configurable, stored in descending order of time, latest first.
Storage
Bigtable uses Google file System(GFS) for the storage, It depends on the underlying cluster management system for dealing with machine failures, replication, scheduling jobs etc.
Storage Format
The underlying Bigtable data is stored in a SSTable file format(stands for sorted string table). Its a persistent, ordered immutable map from keys to values.
SSTable contains a sequence of blocks . A block index stored at the end of the SSTable is used to locate blocks; the index is loaded in-memory when SSTable is opened. Optionally, an SSTable can be completely mapped into memory.
Implementation:
Chubby: Bigtable uses Chubby (a highly available persistent distributed lock service) for:
- Electing one active master at a time
- Bootstrap location of Bigtable data
- Discover tablet servers, tablet deaths.
- Schema information
If Chubby goes down , Bigtable goes unavailable.
One master and many tablet servers- Tablet servers can be dynamically added or removed to accommodate cluster changes.
- Master is responsible for assigning a tablet severs, detecting addition/expiration of tablet servers, balancing tablet-server load, garbage collection of files in GFS, handling schema changes
- Tablet server handles read and write requests to the tablets it has loaded. Master is lightly loaded in practice.
Tablet Assignments
Bigtable uses Chubby extensively to keep track of servers. Master periodically asks each tablet server for the status of its lock, If a tablet server reports that the lock is lost or there is no response after some attempts, master tries to get an exclusive lock on server file. If master is able to acquire back, then the tablet server is either dead or disconnected. So the master deletes the server file, so that this tablet never serves again. And reallocates the tablet data to the set of unassigned tablets.
Tablet Serving
Persistent state of a tablet is stored in GFS. Updates are committed to a commit log that shares the redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called memtable. Updates are stored in a disk in a sequence of SSTables
Compaction
As write operations execute, size of memtable increases. when it reaches some threshold, new memtable is created. The frozen memtable is converted to SSTable and writen to GFS.
A merging compaction that rewrites all SSTables into exactly on SSTable is called major compaction. This will have no deleted entries.
Refinements
- Locality groups: Groups of columns, each with own SSTable.
- Columns that are mostly not read together can be segregated
- Faster reads can be in-memory.
- Compression
- Client specific SSTable compression.
- encode at 100-200 MB/s and decode at 400-1000 MB/s
- Read performance
- Scan Cache: caches key value pairs
- Block cache: caches SSTables block read from GFS
- Bloom Filters
- To check if an SSTable might contain any data for a row/column pair.
- One log file for a tablet server
- Only mutable data structure is memtable
Performance
- Random and sequential writes perform better than random reads as it is append only.
- Sequential reads better than random.