Observability

Alerting: Notifying Teams of Problems

Alerting notifies teams when systems experience problems requiring human attention, transforming passive monitoring into active notification of issues. Effective alerting balances sensitivity (detecting real problems quickly) with specificity (avoiding false alarms), ensures alerts reach appropriate responders, and provides actionable information enabling rapid response. Poor alerting leads to missed incidents or alert fatigue where teams ignore notifications from chronic false alarms.

Alert Sources

Metrics-Based Alerts trigger when measurements cross thresholds: CPU exceeds 90%, error rate above 5%, latency P99 over 1 second. These catch performance degradations and resource exhaustion.

Distributed Tracing: Understanding Request Flows

Distributed tracing tracks requests as they flow through microservices architectures, providing end-to-end visibility into complex, multi-service interactions. While logs show what happened in individual services and metrics reveal aggregate system health, distributed tracing connects the dots, showing how a single user request propagates through dozens of services, revealing performance bottlenecks and failure points that would otherwise remain hidden.

The Challenge

In monolithic applications, understanding request processing is straightforward—follow code execution through a single codebase. In microservices, a single user request might trigger calls to dozens of services, each with its own logs and metrics. Understanding which services were involved, in what order, how long each took, and where failures occurred requires correlating information across all services—a nearly impossible task without distributed tracing.

Logging: Recording System Events and Behaviors

Logging records events, errors, and behaviors occurring in systems, providing detailed information about what happened, when, and why. While metrics answer “what is the current state?”, logs answer “what happened?” and “why did it happen?”. Effective logging enables debugging production issues, auditing system activity, and understanding complex system behaviors that metrics alone cannot reveal.

Log Levels

DEBUG provides detailed information useful during development and debugging. Debug logs are verbose and typically disabled in production due to volume and performance impact.

Metrics: Quantifying System Behavior

Metrics are numerical measurements of system behavior collected over time, providing quantitative data about performance, health, and resource utilization. Unlike logs that capture individual events, metrics aggregate measurements, revealing patterns, trends, and anomalies across the system. Effective metrics enable understanding system state at a glance, detecting problems before they impact users, and making data-driven decisions about capacity and architecture.

Types of Metrics

Counters only increase, tracking cumulative totals: total requests served, total errors, total bytes sent. Counters never decrease (except when resetting). Rate of change is often more interesting than absolute values—requests per second rather than total requests ever served.

Monitoring: Understanding System Health and Performance

Monitoring collects, analyzes, and visualizes metrics about system health and performance. It enables detecting problems, understanding capacity, and making data-driven decisions about infrastructure and application behavior. Effective monitoring transforms opaque systems into transparent ones, providing visibility needed to operate reliably at scale.

Core Metrics

Availability measures uptime—the percentage of time systems are operational and accessible. High availability (99.9% or “three nines”) is standard for production systems, representing about 8.7 hours of downtime per year. Higher availability (99.99% “four nines”) permits only 52 minutes of downtime annually.