System Design Guide

CAP Theorem: Understanding Distributed System Tradeoffs

The CAP theorem, proposed by Eric Brewer, states that a distributed system can provide at most two of three guarantees simultaneously: Consistency, Availability, and Partition tolerance. This fundamental theorem shapes how we design and reason about distributed systems, highlighting inevitable tradeoffs inherent in distributed computing.

The Three Guarantees

Consistency means all nodes see the same data at the same time. When a write completes, all subsequent reads return that written value, regardless of which node services the read. This is equivalent to having a single, up-to-date copy of the data.

Consensus Algorithms: Agreeing in Distributed Systems

Consensus algorithms enable multiple nodes in a distributed system to agree on a single value or sequence of operations despite failures and network issues. This fundamental capability underpins leader election, distributed transactions, and any scenario requiring coordination across unreliable components.

The Consensus Problem

Distributed systems face a deceptively simple challenge: how do multiple independent nodes agree on something? For example, which node should be the primary database? What’s the next transaction to commit? Without consensus, different nodes might make conflicting decisions, leading to split-brain scenarios and data inconsistency.

Consistency Models in Distributed Systems

Consistency models define the rules about how and when updates to distributed data become visible to readers. Understanding these models is crucial for designing distributed systems, as they represent fundamental tradeoffs between performance, availability, and the guarantees provided to applications and users.

Strong Consistency

Strong consistency, also called linearizability, provides the illusion that operations happen atomically at a single point in time. Once a write completes, all subsequent reads return that value, regardless of which replica services the request. This matches our intuitive understanding of how data should behave.

Distributed Transactions: Maintaining Consistency Across Services

Distributed transactions coordinate operations across multiple databases or services, ensuring atomicity: either all operations succeed or all fail, preventing partial updates that leave the system in an inconsistent state. While conceptually straightforward, distributed transactions introduce significant complexity and performance challenges in practice.

The Challenge

In a monolithic application with a single database, transactions are straightforward. The database’s ACID properties ensure atomicity, consistency, isolation, and durability. Wrapping multiple operations in a transaction guarantees they either all commit or all roll back.

Two-Phase Commit: Coordinating Distributed Transactions

Two-Phase Commit (2PC) is a distributed algorithm that coordinates transactions across multiple databases or services to ensure atomicity. Despite being the standard approach for distributed transactions for decades, 2PC has significant limitations that make it unsuitable for many modern distributed systems. Understanding both its mechanics and limitations is essential for architectural decisions.

Protocol Overview

2PC involves a coordinator and multiple participants (databases, services, or resource managers). The coordinator orchestrates the transaction, ensuring all participants either commit or abort together. The protocol proceeds in two distinct phases, hence the name.