System Design Guide

Apache Kafka: Distributed Event Streaming Platform

Apache Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant message processing. Originally developed at LinkedIn, Kafka has become the de facto standard for building real-time data pipelines and streaming applications. Its unique architecture based on append-only logs provides capabilities beyond traditional message queues.

Architecture and Concepts

Topics organize messages by category, similar to database tables or folders. A “user-events” topic contains all user-related events. Topics are durable, replicated, and partitioned for scalability and fault tolerance.

Partitions divide topics into parallel streams. Each partition is an ordered, immutable sequence of messages continuously appended to—a structured commit log. Partitioning enables horizontal scaling: different partitions can reside on different brokers and be consumed in parallel.

Brokers are Kafka servers that store data and serve clients. A Kafka cluster consists of multiple brokers distributing partitions across themselves. Brokers handle client requests for publishing and consuming messages.

Producers publish messages to topics, optionally specifying which partition receives each message. Producers can send messages synchronously (waiting for acknowledgment) or asynchronously (fire-and-forget) based on throughput and reliability requirements.

Consumers read messages from topics, tracking their position (offset) in each partition. Unlike traditional queues where messages are deleted after consumption, Kafka retains messages for a configured retention period, allowing multiple consumers to read independently.

Consumer Groups enable load balancing. Multiple consumers in a group divide partition consumption among themselves. Each partition is consumed by exactly one consumer in the group, providing parallel processing while maintaining partition ordering.

Key Features

High Throughput: Kafka handles millions of messages per second across thousands of topics and partitions. Its design optimizes for sequential disk I/O and efficient batching, achieving throughput that rivals in-memory systems despite using disk storage.

Message Persistence: Messages are persisted to disk and replicated across brokers. This durability ensures messages aren’t lost even if consumers are slow or offline. Retention is configurable: keep messages for days, weeks, or indefinitely.

Replayability: Since Kafka retains messages, consumers can reset their position and reprocess historical data. This is powerful for recovering from errors, reprocessing with new logic, or creating new consumer applications from existing data streams.

Scalability: Add more brokers to scale storage and throughput. Add more partitions to increase parallelism. Add more consumers (in different groups) to process messages multiple ways. Kafka scales horizontally without downtime.

Fault Tolerance: Partition replicas ensure data isn’t lost when brokers fail. If a broker dies, partition replicas on other brokers immediately take over. Kafka automatically handles failover without data loss or significant downtime.

Producer Semantics

Acknowledgment Levels control durability guarantees. acks=0 never waits for acknowledgment (fastest, least reliable). acks=1 waits for the leader replica to acknowledge (balanced). acks=all waits for all in-sync replicas to acknowledge (slowest, most reliable).

Idempotent Producers prevent duplicate messages when retries occur. Enable idempotence and Kafka deduplicates messages automatically using sequence numbers. This provides exactly-once semantics for publishing.

Transactions coordinate messages across multiple topics atomically. All messages in a transaction are visible together or not at all. This supports complex workflows requiring atomic multi-topic updates.

Partitioning Strategies determine which partition receives each message. Default hashing on message key provides consistent partition assignment. Custom partitioners enable specialized routing logic. No key results in round-robin distribution.

Consumer Semantics

Offset Management tracks consumer position in each partition. Kafka stores offsets, allowing consumers to resume where they left off after restarts. Consumers commit offsets after processing messages.

Consumer Groups provide scalability and fault tolerance. Adding consumers to a group distributes partitions among them. Removing consumers triggers rebalancing, reassigning partitions to remaining consumers.

Rebalancing redistributes partition assignments when consumers join or leave. During rebalancing, the group briefly pauses consumption to coordinate the new assignment. Frequent rebalancing impacts throughput, so minimize consumer changes.

Exactly-Once Processing combines idempotent producers, transactional reads, and offset management. Reading from Kafka, processing, and writing back to Kafka can be exactly-once, preventing duplicates or data loss despite failures.

Performance Optimization

Batching groups messages for efficient transmission and storage. Larger batches improve throughput at the cost of latency. Tune batch size and linger time based on throughput versus latency requirements.

Compression reduces network and storage overhead. Kafka supports gzip, snappy, lz4, and zstd compression. Producers compress batches, and consumers decompress automatically. The right codec balances CPU usage and compression ratio.

Partition Count affects parallelism. More partitions enable more concurrent consumers and higher throughput but increase overhead. Start with dozens of partitions per topic and add more as needed.

Replication Factor balances durability and resource usage. Factor of 3 provides good fault tolerance—data survives two broker failures. Higher factors increase durability at the cost of storage and network bandwidth.

Use Cases

Log Aggregation: Kafka collects logs from multiple services, centralizing them for analysis, alerting, or storage. Its high throughput handles vast log volumes efficiently.

Stream Processing: Applications process continuous streams of data—analytics, monitoring, real-time ML inference. Kafka Streams and ksqlDB provide stream processing frameworks built on Kafka.

Event Sourcing: Store application state as a sequence of events in Kafka topics. Rebuild state by replaying events, providing complete audit trails and time-travel debugging.

Metrics Collection: Systems publish metrics to Kafka for aggregation, storage in time-series databases, or real-time dashboards. Kafka handles high-frequency metrics without overwhelming downstream systems.

Change Data Capture: Database changes publish to Kafka topics, enabling other systems to react to data changes. This supports eventual consistency, derived views, and cross-system synchronization.

Kafka Ecosystem

Kafka Connect provides a framework for integrating external systems with Kafka. Connectors for databases, cloud storage, search engines, and more enable building data pipelines with minimal code.

Kafka Streams is a Java library for building stream processing applications. It provides high-level operations like filtering, mapping, joining, and aggregating event streams without requiring separate processing clusters.

ksqlDB enables SQL queries on Kafka streams. Write stream processing logic using familiar SQL syntax, lowering the barrier for building real-time applications.

Schema Registry manages message schemas, ensuring producers and consumers agree on data structure. It supports schema evolution with compatibility checking, preventing breaking changes.

Operational Considerations

Monitoring tracks broker health, partition lag, throughput, and consumer group status. Key metrics include bytes in/out per second, partition count, under-replicated partitions, and consumer lag.

Capacity Planning requires considering message rate, message size, retention period, and replication factor. Storage grows with retention time and replication. Network bandwidth must handle producer and consumer traffic plus replication.

Upgrading Kafka clusters requires careful orchestration. Rolling upgrades minimize downtime, but testing thoroughly in staging environments before production upgrades is essential.

Security includes encryption (TLS), authentication (SASL), and authorization (ACLs). Configure appropriately based on security requirements and deployment environment.

When to Use Kafka

Choose Kafka for high-throughput messaging, event streaming, log aggregation, when message replay is valuable, or when building data pipelines. Its persistence, scalability, and ecosystem make it excellent for these scenarios.

Don’t choose Kafka for simple request-reply patterns, when guaranteed low latency (single-digit milliseconds) is required, or when operational complexity is a concern. Traditional message queues or RPC might be more appropriate for these cases.

Apache Kafka provides a powerful platform for building real-time data-intensive applications. Its log-based architecture, persistence, scalability, and rich ecosystem enable use cases from simple message queuing to complex stream processing pipelines. Understanding Kafka’s architecture, semantics, and operational requirements enables leveraging it effectively for building modern data platforms and event-driven architectures. The key is recognizing when Kafka’s strengths align with your requirements and accepting its operational complexity in exchange for its capabilities.