Database Replication: High Availability and Read Scaling

Database replication involves maintaining copies of a database across multiple servers, with changes to one copy automatically propagated to others. This fundamental technique serves two primary purposes: providing high availability through redundancy and enabling read scaling by distributing read queries across multiple servers.

Replication Models

Primary-Replica (Master-Slave) Replication is the most common model. One primary database accepts all writes, which are then replicated to one or more replica databases. Replicas serve read queries, distributing read load across multiple servers. This model is straightforward to implement and reason about, with a clear authority for data changes.

The primary handles write queries and maintains the authoritative data state. When data changes occur, they’re recorded in a replication log and sent to replicas. Replicas apply these changes to stay synchronized with the primary. If the primary fails, one replica can be promoted to become the new primary, ensuring continuity.

Primary-Primary (Master-Master) Replication allows multiple databases to accept writes simultaneously. This increases write capacity and allows active-active deployment across geographic regions. However, it introduces complexity in conflict resolution when different primaries receive conflicting updates to the same data.

Conflict resolution strategies include last-write-wins (using timestamps), application-specific merge logic, or maintaining both versions and requiring application-level resolution. Each approach has tradeoffs between consistency, complexity, and data loss risk.

Chain Replication has replicas connected in a chain, where each replica receives updates from its upstream neighbor and forwards them downstream. This reduces load on the primary but increases replication lag for replicas further down the chain.

Replication Methods

Synchronous Replication waits for replicas to acknowledge receiving and committing changes before confirming the write to the client. This ensures replicas are always consistent with the primary, providing strong consistency and no data loss during failover. However, it impacts write performance, as writes must wait for network round trips and replica processing.

Asynchronous Replication sends changes to replicas but doesn’t wait for acknowledgment before confirming the write. This provides better write performance but introduces replication lag: replicas might be seconds or minutes behind the primary. If the primary fails before replicas catch up, recent changes can be lost.

Semi-Synchronous Replication waits for at least one replica to acknowledge changes before confirming the write, balancing consistency and performance. This ensures at least one copy exists beyond the primary while maintaining reasonable write performance.

Replication Lag and Consistency

Replication lag is the delay between a write on the primary and its appearance on replicas. This introduces read-your-writes consistency issues: a user might write data and immediately query a replica that hasn’t received the update yet, making it appear their write was lost.

Applications must handle this in several ways. Session affinity routes a user’s reads to the same replica or to the primary after they perform a write. Read-after-write consistency directs reads to the primary for recently modified data. Monotonic reads ensure a user doesn’t see older data after seeing newer data by consistently routing to the same replica.

High Availability through Replication

Replication provides fault tolerance. If the primary database fails, a replica can be promoted to primary, typically with minimal downtime. Automated failover systems monitor primary health and trigger promotion when failures are detected.

Failover challenges include detecting failures reliably (avoiding false positives from network issues), ensuring promoted replicas are sufficiently up-to-date, updating clients and application servers to point to the new primary, and handling the old primary when it recovers to prevent split-brain scenarios.

Read Scaling

By distributing read queries across replicas, replication dramatically increases read capacity. A system with one primary and four replicas can potentially handle five times the read load. This is particularly valuable for read-heavy applications like content sites, analytics dashboards, or reporting systems.

Load balancers distribute read queries across replicas, often with health checking to avoid sending queries to lagging or unhealthy replicas. Different read workloads can target different replicas: real-time application queries to low-lag replicas, heavy analytics queries to dedicated reporting replicas.

Geographic Distribution

Replication enables geo-distributed deployments. Replicas in different geographic regions reduce latency for users worldwide, with each user’s reads served from a nearby replica. This also provides disaster recovery capability, with replicas in different data centers or regions protecting against regional outages.

The challenge is increased replication lag due to inter-region network latency. A write in the US might take 200ms to replicate to a European replica. Applications must be designed to handle this lag gracefully.

Implementation Considerations

Most modern databases provide built-in replication: PostgreSQL has streaming replication, MySQL has binary log replication, and MongoDB has replica sets. Each has specific configuration options, tuning parameters, and operational characteristics.

Monitor replication lag, replica health, and failover behavior. Test failover procedures regularly to ensure they work when needed. Consider replica capacity carefully: a failed primary means replicas must handle write load, requiring sufficient resources.

Replication is fundamental to production database deployment, providing both high availability and scalability. Understanding its characteristics, limitations, and operational requirements is essential for building reliable systems. While it introduces complexity, the benefits of redundancy and read scaling make replication nearly mandatory for any serious production deployment.