System Design Guide

Metrics: Quantifying System Behavior

Metrics are numerical measurements of system behavior collected over time, providing quantitative data about performance, health, and resource utilization. Unlike logs that capture individual events, metrics aggregate measurements, revealing patterns, trends, and anomalies across the system. Effective metrics enable understanding system state at a glance, detecting problems before they impact users, and making data-driven decisions about capacity and architecture.

Types of Metrics

Counters only increase, tracking cumulative totals: total requests served, total errors, total bytes sent. Counters never decrease (except when resetting). Rate of change is often more interesting than absolute values—requests per second rather than total requests ever served.

Gauges represent current values that can increase or decrease: current CPU usage, active connections, queue depth. Gauges snapshot current state, making them suitable for resource utilization metrics.

Histograms track distributions of values: request latencies, response sizes, query durations. Rather than just average latency (which hides outliers), histograms reveal P50, P95, P99, and P99.9—showing how latency varies across requests.

Summaries provide quantile calculations like histograms but computed client-side. They’re less flexible for aggregation but more efficient for certain use cases.

Metric Dimensions

Labels/Tags add dimensions to metrics, enabling filtering and aggregation. A http_requests_total counter might have labels for method (GET, POST), status (200, 404, 500), and endpoint (/api/users, /api/orders).

Labels enable powerful queries: “Show request rate for all POST requests” or “Group error rate by endpoint”. However, high-cardinality labels (like user IDs) create excessive unique metric combinations, overwhelming metric systems.

Cardinality Management is critical. Labels should have limited values. Acceptable: service names (dozens), HTTP methods (handful). Problematic: user IDs (millions), request IDs (unbounded).

Common Application Metrics

Request Rate: Requests per second overall and per endpoint. This measures system load and enables capacity planning.

Error Rate: Errors per second or error percentage. Track both absolute errors (might be acceptable at high traffic) and error percentage (reveals quality degradation).

Latency: Request duration at various percentiles (P50, P95, P99). Median latency shows typical performance; high percentiles reveal tail latency affecting some users.

Throughput: Data volume processed—bytes sent/received, records processed, transactions completed. This complements request rate for understanding system capacity.

Infrastructure Metrics

CPU: Utilization percentage, load average. High CPU indicates compute-bound workloads or inefficient code. Consistent high CPU suggests need for more capacity.

Memory: Used memory, available memory, swap usage. Memory exhaustion causes crashes or severe performance degradation. Monitor to prevent issues.

Disk: Utilization, I/O operations per second (IOPS), read/write throughput. Disk saturation impacts application performance, particularly for database-heavy workloads.

Network: Bytes sent/received, packets dropped, connection counts. Network saturation or errors affect service communication in distributed systems.

Business Metrics

Active Users: Currently logged-in users, concurrent sessions. This reveals actual usage patterns beyond request rates.

Transaction Volume: Orders placed, payments processed, signups completed. Business metrics connect technical performance to business outcomes.

Conversion Rates: Percentage of visitors completing desired actions. Performance affects conversion—slow sites lose customers.

Revenue Metrics: Direct connection between system performance and business results. Revenue per minute, transaction value.

Aggregation

Sum combines metric values: total requests across all instances. This reveals system-wide throughput.

Average calculates mean values: average CPU across instances. However, averages hide outliers—a few overloaded instances might be missed.

Minimum/Maximum: Reveal extremes. Minimum memory available across instances shows which instance is most constrained. Maximum latency shows worst user experience.

Percentiles for distributions: P95 latency means 95% of requests complete faster. P99 shows tail latency. These better represent user experience than averages.

Time-Series Nature

Metrics are time-stamped, creating time series—sequences of measurements over time. This temporal aspect enables:

Trend Analysis: Is CPU usage growing over weeks? Are error rates increasing? Trends support capacity planning and early problem detection.

Anomaly Detection: Sudden changes in metrics indicate problems. Request rate dropping to zero suggests outages; error rates spiking indicate failures.

Seasonal Patterns: Many metrics have daily or weekly cycles. Understanding patterns helps distinguishing normal variation from problems.

Correlation: Comparing time series reveals relationships—error rates rising when latency increases, or throughput dropping when database connections saturate.

Metric Collection

Push-Based: Applications send metrics to collectors. This works for short-lived processes and firewalled environments but requires apps to know collector addresses.

Pull-Based: Collectors scrape metrics from application endpoints. Prometheus exemplifies pull-based collection. Apps expose metrics at HTTP endpoints (like /metrics), and collectors periodically scrape them.

Service Mesh: Sidecar proxies automatically collect metrics for all traffic, enabling metrics without application code changes. Istio and Linkerd provide automatic metrics for microservices.

Storage and Querying

Time-Series Databases optimize for time-stamped data: Prometheus, InfluxDB, TimescaleDB. They provide efficient storage and powerful queries over time ranges.

Retention: Balance storage costs with analysis needs. Keep high-resolution data (second/minute granularity) for days or weeks, lower-resolution aggregates (hourly) for months, daily aggregates for years.

Querying Languages: PromQL (Prometheus), InfluxQL (InfluxDB), or SQL (TimescaleDB) enable complex queries: aggregations, rate calculations, combining multiple metrics, and forecasting.

Visualization

Dashboards display metrics graphically: line graphs for time series, gauges for current values, heatmaps for distributions. Well-designed dashboards provide at-a-glance system understanding.

Real-Time Updates: Live dashboards with auto-refresh show current system state, essential during incidents or deployments.

Historical Views: Zoom out to hours, days, weeks for trend analysis and capacity planning.

Alerting

Threshold-Based: Alert when metrics cross thresholds—CPU exceeds 90%, error rate above 5%, latency P99 over 1 second.

Rate-of-Change: Alert on sudden changes—error rate doubling in 5 minutes, traffic dropping 50% suddenly.

Anomaly Detection: Use statistical methods or machine learning to detect unusual patterns that wouldn’t trigger simple thresholds.

Best Practices

Start with the four golden signals: latency, traffic, errors, saturation. These cover most critical aspects.

Name metrics consistently with clear conventions. Use prefixes like http_, db_, cache_ to group related metrics.

Add useful labels but avoid high cardinality. Service names, environments, regions are good labels. User IDs, request IDs are bad labels.

Measure what matters to users. Metrics about internal behavior only matter if they affect user experience.

Use histograms for latency, not averages. Averages hide outliers that affect user experience.

Set appropriate collection intervals. High-frequency collection (every few seconds) provides granular data but increases overhead and storage. Balance resolution with practical needs.

Document metrics. Teams should understand what metrics measure, their units, and how to interpret them.

Test metric collection. Verify metrics appear in collection systems, queries work correctly, and dashboards update properly.

Review metrics regularly. Are you actually using collected metrics? Remove unused metrics to reduce overhead and simplify systems.

Combine metrics with logs and traces. Metrics show what’s wrong, logs explain why, traces show where in request flows problems occur.

Metrics provide quantitative foundations for observability, enabling understanding system state through numbers rather than guesswork. By measuring key behaviors, analyzing trends, and alerting on anomalies, metrics transform operations from reactive to proactive. When thoughtfully selected, efficiently collected, and intelligently analyzed, metrics reveal system health, guide capacity decisions, and enable data-driven improvements. The goal isn’t collecting every possible metric but collecting the right metrics that genuinely inform decisions and improve system reliability.