System Design Guide

Consensus Algorithms: Agreeing in Distributed Systems

Consensus algorithms enable multiple nodes in a distributed system to agree on a single value or sequence of operations despite failures and network issues. This fundamental capability underpins leader election, distributed transactions, and any scenario requiring coordination across unreliable components.

The Consensus Problem

Distributed systems face a deceptively simple challenge: how do multiple independent nodes agree on something? For example, which node should be the primary database? What’s the next transaction to commit? Without consensus, different nodes might make conflicting decisions, leading to split-brain scenarios and data inconsistency.

The challenge is that nodes and networks fail. Messages get lost, delayed, or duplicated. Nodes crash and recover. Byzantine failures cause nodes to behave arbitrarily or maliciously. Consensus algorithms must produce agreement despite these difficulties.

A consensus algorithm must satisfy three properties: agreement (all correct nodes decide on the same value), validity (the decided value was proposed by some node), and termination (all correct nodes eventually decide). Achieving all three in the face of failures is the core challenge.

Paxos

Paxos, introduced by Leslie Lamport, was the first practical consensus algorithm proven correct. It works in asynchronous networks and tolerates node failures but not Byzantine failures. Despite its theoretical elegance, Paxos is notoriously difficult to understand and implement correctly.

Paxos uses three roles: proposers suggest values, acceptors vote on proposals, and learners receive the decided value. A value is chosen when a majority of acceptors accept it. The protocol ensures that once a value is chosen, subsequent rounds propose the same value, achieving consistency.

Multi-Paxos extends basic Paxos to reach consensus on a sequence of values efficiently. After electing a leader in one Paxos round, the leader can propose values directly until leadership is challenged, avoiding the full Paxos protocol overhead for each value.

Paxos implementations include Google’s Chubby lock service and Apache ZooKeeper (which uses a variant called ZAB). Despite its complexity, Paxos proved that practical consensus was possible, inspiring subsequent algorithms.

Raft

Raft was designed to be understandable while providing the same guarantees as Paxos. It explicitly optimizes for clarity, making it easier to implement correctly. Raft has become popular due to its comprehensibility and practical performance.

Raft explicitly separates concerns: leader election, log replication, and safety. At any time, one node is the leader, and others are followers. The leader handles all client requests, appending them to its log and replicating to followers.

Leader Election uses randomized timeouts. Followers wait for heartbeats from the leader. If a timeout occurs without a heartbeat, a follower becomes a candidate and requests votes. If it receives majority votes, it becomes the leader. Randomized timeouts prevent simultaneous candidacies that would split votes indefinitely.

Log Replication has the leader append new entries to its log and replicate them to followers. Once a majority of followers confirm replication, the entry is committed, and the leader applies it to its state machine. This majority requirement ensures durability even if some nodes fail.

Safety properties ensure that once a log entry is committed, all future leaders include that entry. Raft’s election rules ensure new leaders have the most up-to-date log among the majority, preventing overwriting committed entries.

Raft is used in etcd (Kubernetes’ coordination service), HashiCorp Consul, and many distributed databases. Its clarity makes it a popular choice for systems requiring consensus.

Two-Phase Commit (2PC)

Two-phase commit coordinates transactions across multiple databases or services. It ensures atomicity: either all participants commit or all abort, preventing partial failures from leaving the system in an inconsistent state.

Phase 1: Prepare has the coordinator ask all participants if they can commit the transaction. Each participant checks if it can proceed and responds with yes (and locks resources) or no (and aborts locally).

Phase 2: Commit or Abort depends on responses. If all participants voted yes, the coordinator sends commit messages, and all participants commit. If any voted no or failed to respond, the coordinator sends abort messages.

2PC ensures atomicity but has significant drawbacks. It’s a blocking protocol: if the coordinator crashes between phases, participants remain blocked, holding locks and unable to proceed. This makes 2PC impractical for systems requiring high availability.

Three-Phase Commit attempts to address 2PC’s blocking problem by adding an intermediate phase, allowing participants to timeout and make progress even if the coordinator fails. However, 3PC is complex and rarely used in practice.

Byzantine Fault Tolerance

Byzantine consensus algorithms tolerate nodes behaving arbitrarily—sending conflicting messages, lying, or colluding. This is essential for systems without trust between participants, such as blockchains.

Practical Byzantine Fault Tolerance (PBFT) reaches consensus despite up to f Byzantine nodes among 3f+1 total nodes. Clients send requests to all replicas, a primary orders requests, and replicas execute a three-phase protocol to agree on the order.

PBFT requires multiple rounds of all-to-all communication, making it expensive. Throughput is limited, and latency is high compared to non-Byzantine algorithms. This cost is acceptable in blockchain systems where Byzantine tolerance is essential but excessive for trusted environments.

Blockchain Consensus in systems like Bitcoin uses proof-of-work: nodes compete to solve cryptographic puzzles, with the winner proposing the next block. This achieves consensus without trusted parties but is extremely expensive in computation and energy.

Practical Considerations

Consensus algorithms trade throughput for consistency. Strong consistency requires coordination, limiting throughput to hundreds or thousands of operations per second. Systems needing higher throughput often use weaker consistency models.

Latency includes network round trips for coordination. Cross-datacenter consensus might take hundreds of milliseconds. Keep consensus-requiring operations on the critical path minimal, using asynchronous replication where strong consistency isn’t necessary.

Leader-based algorithms like Raft and Multi-Paxos provide better performance during normal operation but require leader re-election during failures, causing brief unavailability. Leaderless approaches like basic Paxos avoid this but have higher per-operation overhead.

Quorum sizes matter for both correctness and availability. Majority quorums (n/2 + 1) tolerate minority failures while maintaining consistency. Larger quorums increase consistency guarantees but reduce availability; smaller quorums increase availability but risk inconsistency.

Implementation Challenges

Correctly implementing consensus algorithms is notoriously difficult. Subtle bugs can violate safety properties, leading to data loss or corruption. Use existing, battle-tested implementations (etcd, ZooKeeper, Consul) rather than building your own unless you have specific requirements and expertise.

Testing consensus implementations requires simulating network partitions, message delays, and node failures. Tools like Jepsen test distributed systems by inducing failures and checking for consistency violations. Many supposed consensus implementations fail under Jepsen testing.

Monitoring should track leader elections, consensus latency, quorum health, and any split-brain detections. High election frequency suggests network issues or improper timeout configuration. Extended consensus latency indicates overload or network problems.

When to Use Consensus

Use consensus algorithms when strong consistency is required: distributed transactions, leader election, distributed locking, or coordinating configuration changes. The cost in complexity and performance is justified when inconsistency causes correctness problems.

Avoid consensus when possible. Many systems overuse consensus, applying it where weaker consistency would suffice. Use event sourcing, CRDTs, or application-level conflict resolution when they meet requirements, reserving consensus for truly critical coordination.

Consensus algorithms are powerful tools for building reliable distributed systems. Understanding their properties, tradeoffs, and appropriate use cases enables designing systems that provide necessary consistency guarantees while maintaining acceptable performance and availability. The key is using consensus judiciously, only where strong coordination is truly required.