System Design Guide

Monitoring: Understanding System Health and Performance

Monitoring collects, analyzes, and visualizes metrics about system health and performance. It enables detecting problems, understanding capacity, and making data-driven decisions about infrastructure and application behavior. Effective monitoring transforms opaque systems into transparent ones, providing visibility needed to operate reliably at scale.

Core Metrics

Availability measures uptime—the percentage of time systems are operational and accessible. High availability (99.9% or “three nines”) is standard for production systems, representing about 8.7 hours of downtime per year. Higher availability (99.99% “four nines”) permits only 52 minutes of downtime annually.

Latency measures response time—how long operations take. Track percentiles, not just averages: P50 (median), P95, P99, and P99.9 reveal tail latency affecting some users. A P99 of 500ms means 1% of requests take longer than 500ms—potentially thousands of slow requests per hour in high-traffic systems.

Throughput measures request rate—how many operations the system handles per second, minute, or hour. Understanding throughput reveals capacity limits and growth trends.

Error Rate tracks failures—requests returning errors, exceptions thrown, or operations failing. Monitor absolute error counts and error percentages. Even at low percentages, high absolute error counts indicate problems needing attention.

Saturation measures resource utilization—CPU, memory, disk, and network usage. High saturation (>80-90%) often leads to performance degradation. Monitoring saturation predicts when capacity additions are needed before performance degrades.

The Four Golden Signals

Google’s Site Reliability Engineering introduces four golden signals capturing most system health aspects:

Latency: How long does it take to handle requests? Monitor both successful and failed request latency—failures often complete quickly but indicate problems.

Traffic: How much demand is the system handling? Measured in requests per second, bytes per second, or domain-specific metrics.

Errors: What is the failure rate? Track explicit failures (500 errors) and implicit failures (wrong content returned).

Saturation: How “full” is the system? Monitor resource utilization and queueing that precedes resource exhaustion.

Metrics Collection

Push-Based collection has services actively send metrics to collectors. This works well for short-lived processes and behind NAT/firewalls but requires services to know collector locations.

Pull-Based collection has collectors scrape metrics from service endpoints. Prometheus exemplifies this approach. Pull-based simplifies service configuration (no need to know collector addresses) and enables collectors to detect scrape failures.

Metric Aggregation combines metrics across multiple instances. Summing throughput across servers, averaging CPU usage, or taking maximum latency provides fleet-wide views. However, aggregation can hide problems affecting subsets of instances.

Time-Series Databases store metrics efficiently. InfluxDB, Prometheus, and TimescaleDB optimize for time-stamped data with append-heavy writes and time-range queries.

Visualization and Dashboards

Real-Time Dashboards display current system state. Operators monitor dashboards during incidents, deployments, or normal operations to spot anomalies quickly.

Historical Analysis reveals trends over days, weeks, or months. Is CPU usage growing? Are error rates increasing? Historical data supports capacity planning and identifying gradual degradations.

Drill-Down Capabilities enable investigating anomalies. Start with high-level dashboards showing overall health, then drill into specific services, instances, or time ranges to isolate problems.

Dashboard Best Practices: Avoid clutter—too many metrics overwhelm. Prioritize actionable metrics that guide decisions. Use colors meaningfully—red for critical issues, yellow for warnings, green for healthy. Standardize dashboards across services for consistency.

Alerting

Alert Thresholds define when notifications trigger. Set thresholds based on SLOs and historical baselines. Too sensitive causes alert fatigue; too lenient misses problems.

Alert Routing sends notifications to appropriate responders via email, SMS, chat, or paging systems. Critical alerts might page on-call engineers; warnings might post to team channels.

Alert Fatigue occurs when too many alerts desensitize teams. Address alert fatigue by eliminating noisy alerts, aggregating related alerts, and only alerting on actionable problems requiring human intervention.

Runbooks document response procedures for common alerts. When an alert fires, runbooks guide responders through diagnosis and remediation, speeding recovery and enabling less experienced engineers to handle incidents.

Service Level Objectives (SLOs)

SLIs (Service Level Indicators) are metrics measuring service quality from user perspectives: latency, availability, throughput. Good SLIs directly impact user experience.

SLOs (Service Level Objectives) set targets for SLIs: “99.9% of requests complete within 500ms” or “99.95% availability.” SLOs define reliability targets balancing user expectations with operational cost.

Error Budgets derive from SLOs. If your SLO is 99.9% availability, you have a 0.1% error budget—43 minutes of downtime monthly. Consuming error budgets signals slowing feature development to focus on reliability.

SLAs (Service Level Agreements) are contractual commitments to customers, often with financial penalties for violations. SLAs should be less strict than SLOs to provide safety margin.

Synthetic Monitoring

Health Checks periodically probe services to verify availability. Simple health checks might just verify HTTP responses; sophisticated ones validate end-to-end functionality.

Synthetic Transactions simulate real user interactions. For e-commerce sites, synthetic monitoring might periodically execute complete checkout flows, detecting problems before real users encounter them.

External Monitoring verifies services from outside your infrastructure, detecting issues users would experience: DNS problems, CDN failures, or region-specific outages.

Distributed Monitoring

Monitoring distributed systems introduces challenges: correlating metrics across services, understanding cascading failures, and attributing problems to root causes.

Service-Level Monitoring tracks each service independently. This identifies which services are unhealthy but doesn’t reveal how failures propagate.

Request Tracing (covered in distributed tracing) tracks individual requests across services, providing end-to-end visibility.

Dependency Monitoring tracks health of service dependencies. If the database is down, all services depending on it will be unhealthy—understanding dependencies prevents misdiagnosing symptoms as causes.

Capacity Planning

Trend Analysis projects future resource needs based on historical growth. If CPU usage increases 2% monthly, when will you exceed capacity? Proactive capacity additions prevent degradations.

Load Testing validates capacity limits. Gradually increase synthetic load until systems degrade, revealing maximum sustainable throughput and identifying bottlenecks.

Right-Sizing matches resource allocations to actual needs. Over-provisioning wastes money; under-provisioning degrades performance. Monitoring informs right-sizing decisions.

Monitoring Tools

Prometheus is popular for metrics collection and alerting, especially in Kubernetes environments. Its pull-based model and rich query language make it powerful for modern infrastructure.

Grafana visualizes metrics from various sources, creating beautiful dashboards for Prometheus, InfluxDB, Elasticsearch, and more.

Datadog provides comprehensive observability-as-a-service, combining metrics, logs, and traces with minimal operational overhead.

New Relic offers application performance monitoring with deep code-level visibility and infrastructure monitoring.

CloudWatch (AWS), Azure Monitor, and Cloud Monitoring (GCP) provide cloud-native monitoring integrated with respective cloud platforms.

Best Practices

Start with the four golden signals: latency, traffic, errors, and saturation. These cover most important aspects of system health.

Alert on symptoms, not causes. Alert when users are affected (high error rates, slow responses) rather than intermediate states (high CPU usage).

Make metrics easily discoverable and understandable. Name metrics consistently, document their meanings, and expose them through standard interfaces.

Monitor continuously, not just during business hours. Problems don’t wait for convenient times.

Automate responses to common problems when possible. Automated remediation (like restarting failed services) can resolve issues faster than human intervention.

Review and refine monitoring regularly. As systems evolve, monitoring must evolve too. Regularly assess whether alerts remain relevant and metrics meaningful.

Test monitoring systems. Deliberately introduce failures to verify monitoring detects them and alerts appropriately.

Monitoring provides the visibility needed to operate complex systems reliably. By collecting metrics, visualizing trends, alerting on problems, and analyzing historical data, monitoring transforms reactive firefighting into proactive system management. Effective monitoring doesn’t just detect problems—it prevents them through capacity planning, enables rapid diagnosis through comprehensive metrics, and guides improvements through data-driven insights. The investment in robust monitoring pays dividends in reliability, performance, and operational efficiency.