System Design Guide

Logging: Recording System Events and Behaviors

Logging records events, errors, and behaviors occurring in systems, providing detailed information about what happened, when, and why. While metrics answer “what is the current state?”, logs answer “what happened?” and “why did it happen?”. Effective logging enables debugging production issues, auditing system activity, and understanding complex system behaviors that metrics alone cannot reveal.

Log Levels

DEBUG provides detailed information useful during development and debugging. Debug logs are verbose and typically disabled in production due to volume and performance impact.

INFO records general information about normal operation: service started, user logged in, request completed. Info logs provide operational context without overwhelming detail.

WARN indicates potential problems that don’t prevent operation: deprecated API usage, fallback to default configuration, retryable failures. Warnings deserve attention but don’t require immediate action.

ERROR records failures: exceptions, failed requests, resource unavailability. Errors require investigation and typically trigger alerts.

FATAL/CRITICAL indicates severe failures causing application termination or critical system malfunctions. These demand immediate attention.

Structured Logging

Key-Value Pairs provide machine-parseable logs rather than free-form text. Instead of “User john logged in”, use {"event": "user_login", "username": "john", "timestamp": "2024-01-15T10:30:00Z"}.

JSON Format is common for structured logs, offering universal parsing support and rich data types. All modern log aggregation systems parse JSON efficiently.

Consistent Schema across services simplifies analysis. Standardize field names: use user_id consistently, not user_id in one service and userId in another.

Benefits: Structured logs enable filtering (“show all errors for user X”), aggregation (“count login events by hour”), and correlation (“find all logs for request ID Y”) that are difficult or impossible with unstructured text.

Log Management

Centralized Logging aggregates logs from all services into one system. This eliminates searching individual servers and enables correlating events across services.

Log Aggregation Tools include ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Datadog, and cloud-native solutions (CloudWatch Logs, Azure Monitor Logs, Google Cloud Logging).

Log Shippers collect logs from applications and forward them to aggregation systems. Filebeat, Fluentd, and Logstash function as shippers, handling buffering, retries, and transformation.

Retention Policies balance storage costs with investigation needs. Recent logs (days to weeks) need fast access; older logs (months) can use cheaper archival storage. Eventually delete or archive ancient logs.

What to Log

Request Context: Include request IDs, user IDs, and correlation IDs linking related operations. This enables tracing requests through the system.

Errors and Exceptions: Log full stack traces, error messages, and context about what operation failed. Include parameters and state that might explain the failure.

Important Operations: Log significant business events—orders placed, payments processed, accounts created. These provide audit trails and support investigations.

Performance Markers: Log operation durations for slow operations. This supplements metrics with detailed timing information for specific requests.

Security Events: Log authentication attempts, authorization failures, and sensitive data access. Security logs support incident response and compliance audits.

What Not to Log

Passwords and Secrets: Never log passwords, API keys, tokens, or other credentials. Even hashed passwords shouldn’t be logged.

Personal Identifiable Information (PII): Credit card numbers, Social Security numbers, and sensitive personal data often shouldn’t be logged, or should be masked/redacted.

Excessive Detail: Don’t log every variable in every function. Too much logging overwhelms systems and obscures important information in noise.

Sensitive Business Data: Proprietary algorithms, pricing details, or confidential business information requires careful handling, potentially excluding it from logs entirely.

Correlation and Context

Correlation IDs track requests across services. Generate unique IDs for each request and propagate them through all services the request touches. When debugging, filter logs by correlation ID to see the complete request flow.

Thread/Worker IDs in multi-threaded applications help correlating logs from the same execution thread.

User/Session Context: Include user IDs and session IDs in logs. This enables investigating specific user experiences and detecting patterns in user behavior.

Timestamps: Always include precise timestamps (preferably UTC). Ensure timestamp synchronization across services using NTP.

Log Sampling and Throttling

High-Volume Systems might produce millions of log entries per second. Logging everything is impractical and expensive.

Sampling logs a percentage of events rather than all events. For example, log 1% of successful requests but 100% of errors. This reduces volume while maintaining visibility into problems.

Dynamic Sampling adjusts rates based on conditions: increase sampling when error rates rise, reduce when everything is healthy.

Rate Limiting prevents log storms where a single issue generates millions of identical log entries. Limit logging frequency for specific events, perhaps logging “this error occurred 10,000 times” instead of individual entries.

Log Analysis

Search and Filter capabilities enable finding specific events: “show errors in the payments service in the last hour” or “find all logs for correlation ID X”.

Aggregation and Statistics: Count occurrences (“how many 404 errors?”), group by dimensions (“error rates by service”), and identify trends (“is error rate increasing?”).

Alerting on Log Patterns: Trigger alerts based on log contents: alert when error rates exceed thresholds, specific critical errors appear, or unusual patterns emerge.

Log Visualization: Dashboards displaying log-based metrics (like error counts over time) complement metrics-based monitoring.

Performance Considerations

Asynchronous Logging: Write logs asynchronously to avoid blocking application threads. Queue log messages and write them in background threads.

Log Level Configuration: Set appropriate log levels for production. Debug logging in production impacts performance and creates excessive data.

Buffering: Buffer log writes rather than flushing immediately. This improves throughput at the cost of potential log loss if applications crash before flushing.

Sampling in High-Throughput Paths: For extremely high-throughput code paths, sample logging to avoid performance impact.

Logging in Microservices

Distributed Correlation: In microservices, a single user request might touch dozens of services. Correlation IDs must propagate through all services to reconstruct request flows.

Service Identification: Logs must clearly indicate which service generated them. Include service names and versions in all log entries.

Consistent Formatting: Standardize log formats across services. Mixed formats complicate parsing and analysis.

Centralized Collection: With hundreds of microservices, centralized logging is essential. Individual service logs are impractical to access and correlate.

Compliance and Audit Logging

Regulatory Requirements: Many regulations mandate logging certain events: access to sensitive data, administrative actions, security events.

Tamper-Proof Logs: Some requirements demand logs cannot be modified or deleted. This might require write-once storage or cryptographic techniques ensuring log integrity.

Retention Periods: Regulations often specify minimum log retention (typically months to years). Plan storage capacity and archival strategies accordingly.

Access Controls: Restrict who can view logs, especially those containing sensitive information. Implement role-based access control for log systems.

Best Practices

Log at appropriate levels. Info for normal operation, warn for potential issues, error for failures. Don’t log everything as errors or warnings.

Include context in log messages. “Database query failed” is less useful than “Database query failed: SELECT from users WHERE id=123, error: connection timeout”.

Use structured logging with consistent schemas. This simplifies parsing, searching, and analysis.

Implement correlation IDs from the start. Adding them later is difficult. Propagate them through all services and include in all logs.

Never log sensitive data. Implement automated scanning for credentials or PII in logs.

Monitor log system health. Ensure logs are being shipped, aggregation systems are healthy, and retention policies are working.

Test log collection. Deliberately generate errors and verify they appear in log aggregation systems with proper context.

Review logs regularly. Don’t just collect logs; use them. Investigate errors, identify patterns, and improve systems based on log insights.

Logging provides detailed visibility into system behavior essential for debugging, auditing, and understanding complex interactions. While metrics tell you something is wrong, logs explain what happened and why. Effective logging requires balancing detail with performance, protecting sensitive information while capturing useful context, and centralizing logs from distributed systems for coherent analysis. When implemented thoughtfully, logging transforms opaque systems into transparent ones where problems can be quickly diagnosed and operations thoroughly understood.