System Design Guide

Distributed Transactions: Maintaining Consistency Across Services

Distributed transactions coordinate operations across multiple databases or services, ensuring atomicity: either all operations succeed or all fail, preventing partial updates that leave the system in an inconsistent state. While conceptually straightforward, distributed transactions introduce significant complexity and performance challenges in practice.

The Challenge

In a monolithic application with a single database, transactions are straightforward. The database’s ACID properties ensure atomicity, consistency, isolation, and durability. Wrapping multiple operations in a transaction guarantees they either all commit or all roll back.

Distributed systems fragment data across multiple databases or services. An e-commerce order might update the orders service, decrement inventory in the inventory service, and charge the payment service. These operations must all succeed or all fail—partial success leaves the system inconsistent, such as charging customers for unavailable products.

Achieving transaction semantics across distributed components is fundamentally more difficult than within a single database. Network failures, service crashes, and timing issues create scenarios that don’t exist in single-database transactions.

Two-Phase Commit Protocol

Two-phase commit (2PC) is the traditional approach to distributed transactions. A coordinator orchestrates the transaction across participants, ensuring atomicity through a two-phase protocol.

Prepare Phase: The coordinator asks all participants to prepare the transaction. Each participant performs all operations, locks resources, and writes to a transaction log but doesn’t commit. Participants respond with vote-commit (if prepared successfully) or vote-abort (if unable to proceed).

Commit Phase: If all participants voted commit, the coordinator sends commit messages to all. Each participant commits the transaction and releases locks. If any participant voted abort or failed to respond, the coordinator sends abort messages, and all participants roll back.

2PC ensures atomicity but has severe limitations. It’s a blocking protocol: if the coordinator fails after the prepare phase, participants remain blocked, holding locks indefinitely. This blocks other transactions and can deadlock the system.

Performance suffers from multiple synchronous network round trips. Each participant must wait for coordination before proceeding, and locks are held across network delays. This dramatically reduces throughput compared to local transactions.

Availability decreases since a transaction fails if any participant is unavailable. Distributed transactions are only as available as their least available component.

Saga Pattern

Sagas provide an alternative to distributed transactions by breaking a large operation into a sequence of local transactions, each with a compensating transaction that undoes its effects.

For an order, the saga might: (1) create an order record, (2) reserve inventory, (3) process payment. If any step fails, compensating transactions undo previous steps: (1) cancel order, (2) release inventory reservation. The system eventually reaches a consistent state, though temporarily inconsistent during execution.

Choreography-based sagas use events to coordinate. Each service listens for events, performs its local transaction, and publishes an event for the next service. There’s no central coordinator; the saga flow emerges from service interactions.

Orchestration-based sagas use a coordinator that explicitly directs each step. The coordinator calls services sequentially, tracking progress and triggering compensations on failures. This provides better visibility and control but introduces a single point of failure.

Sagas accept temporary inconsistency in exchange for better availability and performance. Users might briefly see orders pending payment, but the system eventually reconciles to consistency. For many business processes, this tradeoff is acceptable.

Event Sourcing and CQRS

Event sourcing stores state changes as immutable events rather than current state. To rebuild state, replay events from the beginning. This naturally supports distributed consistency: services consume events and update their state asynchronously.

Combined with CQRS (Command Query Responsibility Segregation), writes generate events consumed by read models optimized for queries. Write and read paths separate, allowing eventual consistency between them while maintaining event log consistency.

Distributed transactions aren’t needed since each service maintains its own state derived from the event log. Different services might temporarily have inconsistent views, but they eventually converge by processing the same events.

This approach works well for event-driven systems where eventual consistency is acceptable and audit trails are valuable. The downside is increased complexity in application logic and reasoning about eventual consistency.

Eventual Consistency

Many distributed systems avoid distributed transactions entirely, accepting eventual consistency. Operations succeed locally without coordination, and the system uses conflict resolution or CRDTs (Conflict-free Replicated Data Types) to eventually reach consistency.

Shopping carts are classic examples: adding items updates a local replica without distributed transactions. Different replicas might temporarily disagree, but merging them (using set union) produces a consistent cart. Some business rules relax: if two users add the last item simultaneously, both succeed, and the system oversells temporarily.

This requires application-level conflict handling but provides better availability and performance than distributed transactions. The key is identifying which consistency requirements are business-critical versus which can be eventually consistent.

Idempotency and Retries

Distributed operations often fail midstream: did the remote operation succeed but the response was lost, or did it fail? Retrying might duplicate the operation, violating correctness.

Idempotent operations produce the same result regardless of how many times they execute. Setting a value is idempotent; incrementing isn’t. Designing operations to be idempotent allows safe retries without distributed transactions.

Idempotency keys make non-idempotent operations idempotent by tracking operation identifiers. If a retry uses the same key as a completed operation, the system recognizes it as a duplicate and returns the original result without re-executing.

Payment systems extensively use idempotency keys. Retrying a payment with the same idempotency key charges once, not multiple times, even if the original response was lost.

Choosing an Approach

For critical operations requiring strong atomicity (financial transactions, inventory allocation), distributed transactions or sagas with careful compensation logic are necessary. Accept the performance and complexity costs in exchange for correctness.

For less critical operations, eventual consistency with conflict resolution often suffices. User preference updates, social media interactions, and analytics don’t need distributed transactions.

Many systems use a mix: strong consistency for critical paths (checkout, payment) and eventual consistency elsewhere (product catalogs, user profiles). This balances consistency requirements with performance and availability.

Best Practices

Minimize distributed transaction scope. Fewer participants mean better performance and availability. Consider whether operations truly need to be atomic or if eventual consistency suffices.

Design for idempotency from the beginning. This simplifies retry logic and recovery from partial failures. Use unique identifiers for operations and track completion.

Implement compensating transactions carefully in saga patterns. Compensation doesn’t always mean simple rollback; it might mean credit refunds, notification emails, or other business logic. Test compensation paths thoroughly.

Monitor transaction success rates, latency, and failure modes. High failure rates indicate availability issues or improper timeout configuration. Extended latency suggests network problems or overloaded participants.

Use existing frameworks when possible. Implementing distributed transaction protocols correctly is difficult. Libraries and frameworks like Seata, Eventuate, or orchestration platforms provide tested implementations.

Distributed transactions represent fundamental tradeoffs between consistency, performance, and availability. Understanding these tradeoffs and the available patterns enables designing systems that provide appropriate consistency guarantees for each use case while maintaining acceptable performance and reliability. The goal isn’t always perfect ACID transactions, but rather the right consistency model for your specific requirements.