System Design Guide

Alerting: Notifying Teams of Problems

Alerting notifies teams when systems experience problems requiring human attention, transforming passive monitoring into active notification of issues. Effective alerting balances sensitivity (detecting real problems quickly) with specificity (avoiding false alarms), ensures alerts reach appropriate responders, and provides actionable information enabling rapid response. Poor alerting leads to missed incidents or alert fatigue where teams ignore notifications from chronic false alarms.

Alert Sources

Metrics-Based Alerts trigger when measurements cross thresholds: CPU exceeds 90%, error rate above 5%, latency P99 over 1 second. These catch performance degradations and resource exhaustion.

Log-Based Alerts trigger on specific log patterns: particular errors appearing, security events detected, unusual access patterns. These catch issues that metrics might miss.

Synthetic Monitoring Alerts trigger when health checks fail: services unreachable, critical endpoints returning errors, end-to-end tests failing. These detect outages before users report them.

External Monitoring Alerts trigger when external checks fail: DNS resolution problems, SSL certificate expiration, third-party service unavailability. These catch issues internal monitoring might miss.

Alert Severity

Critical/P1 indicates severe impact: complete service outages, data loss, security breaches. These require immediate response, potentially waking on-call engineers any time.

High/P2 indicates significant degradation: elevated error rates, performance significantly degraded, important features unavailable. These require prompt response during business hours or within SLA windows.

Medium/P3 indicates minor issues: non-critical features degraded, warnings about approaching limits, deprecated API usage. These merit investigation but don’t demand immediate response.

Low/P4 indicates informational items: capacity recommendations, minor configuration improvements, successful deployments. These inform rather than demand action.

Alert Routing

On-Call Rotations ensure someone is responsible for responding to alerts. Rotations distribute burden fairly, prevent burnout, and provide 24/7 coverage.

Escalation Policies define what happens when alerts aren’t acknowledged: notify secondary responders after 5 minutes, escalate to managers after 15 minutes, page executives after 30 minutes for critical incidents.

Channel Selection matches urgency to medium: critical alerts via phone calls or SMS, high priority via dedicated alerting apps (PagerDuty, OpsGenie), medium priority via chat channels, low priority via email.

Team Assignment routes alerts to appropriate teams. Database alerts go to DBAs, application errors to developers, infrastructure issues to SRE. Proper routing ensures alerts reach people who can address them.

Alert Conditions

Threshold-Based: Simple comparisons—CPU > 90%, error rate > 5%. Easy to understand but can be noisy if thresholds aren’t well-tuned.

Anomaly Detection: Statistical methods detect unusual values even without fixed thresholds. Useful for metrics with variable baselines—traffic patterns vary daily, but anomaly detection spots unusual deviations.

Rate of Change: Alert on sudden changes—error rate doubled in 5 minutes, traffic dropped 50% suddenly. These catch incidents faster than absolute thresholds since degradations manifest as changes.

Composite Conditions: Combine multiple metrics—alert when both error rate is high AND latency is elevated, not just one or the other. This reduces false positives from transient spikes in single metrics.

Time Windows: Require conditions to persist for durations: CPU sustained above 90% for 5 minutes, not just one data point. This eliminates alerting on brief, self-correcting spikes.

Alert Content

What’s Wrong: Clear description of the problem—“API latency P99 exceeded 2 seconds” not just “Latency alert triggered”.

Why It Matters: Explain impact—“Users experiencing slow page loads” connects technical issues to user experience.

Where: Identify affected components—specific services, regions, or instances. “us-east-1 region” is more actionable than “somewhere”.

When: Include timing—when did the alert start? How long has it been ongoing? Context helps responders understand severity.

How to Fix: Link to runbooks with remediation procedures. First-responders shouldn’t need to figure out responses from scratch.

Relevant Links: Include dashboard links, log queries, or trace views. Provide tools responders need immediately without hunting for them.

Alert Fatigue

Causes: Too many alerts, frequent false positives, alerts for non-actionable conditions, duplicate alerts for single issues.

Consequences: Teams ignore or delay responding to alerts, real incidents get missed, on-call becomes demoralizing, burnout increases.

Solutions: Aggregate related alerts (send one alert for “service unhealthy” rather than alerts for each unhealthy instance), adjust thresholds to reduce false positives, only alert on actionable problems requiring human intervention, implement alert suppression during known maintenance windows.

Actionability

Good Alerts require and enable specific actions. “Database primary failed over” requires verifying the replica promoted correctly and investigating why the primary failed.

Bad Alerts either don’t require action (“FYI: deployment completed successfully”—great for logs, wrong for alerts) or don’t enable action (“something went wrong”—what? where? how to fix?).

The Page Worthy Test: Would you wake someone at 3 AM for this? If no, it’s not critical severity. If yes, does the alert provide enough information for them to respond effectively?

Alert Testing

Inject Failures: Deliberately cause alert conditions and verify alerts trigger: take services down, exhaust resources, generate errors. This confirms alerting works before real incidents.

Test Routing: Ensure alerts reach intended recipients via configured channels. Verify on-call schedules work and escalations function.

Measure Response Time: Track how long between alert firing and acknowledgment. Long times indicate alerts aren’t reaching responders or aren’t taken seriously.

Runbooks

Problem Description: What does the alert mean? What’s failing or degraded?

Impact Assessment: Who’s affected? How severe? Is it user-facing or internal?

Diagnostic Steps: How to confirm the alert is legitimate vs. false positive? What to check first?

Remediation: What actions fix the problem? Restart services? Scale up resources? Roll back deployments?

Escalation: When to escalate? Who to involve for complex issues?

Post-Incident: What to document? What follow-up investigation is needed?

Alert Analytics

Track Alert Volume: How many alerts fire? Are some services nosier than others? Trends indicate whether alerting is improving or degrading.

Measure Time-to-Acknowledge: How quickly do responders acknowledge alerts? Slow times suggest alerts aren’t reaching responders or lack urgency signals.

Track Resolution Time: How long from alert to resolution? Long times might indicate inadequate runbooks or complex problems.

False Positive Rate: What percentage of alerts don’t represent real problems? High rates cause alert fatigue and erode trust in alerting.

Alert Coverage: What incidents didn’t trigger alerts? Missing alerts indicate gaps in monitoring or alerting.

Maintenance Windows

Scheduled Suppressions: Disable alerts during planned maintenance. Don’t wake engineers for expected downtime from deployments or upgrades.

Automatic Suppression: Some systems automatically suppress alerts for services marked as under maintenance, eliminating manual suppression management.

Communication: Ensure teams know about maintenance windows to avoid confusion when alerts don’t fire during expected problems.

Best Practices

Alert on symptoms, not causes. Alert when users are impacted (high error rates, slow responses) rather than intermediate states (high CPU). Symptoms directly correlate with user experience; causes might not.

Make alerts actionable. Every alert should have clear next steps. Non-actionable alerts should be logged, not alerted.

Tune thresholds based on actual experience. Start conservatively and adjust based on false positive rates and missed incidents.

Include context in alerts. Link to dashboards, runbooks, and relevant logs. Responders shouldn’t hunt for information.

Review alerts regularly. Are they still relevant? Do they trigger appropriately? Remove alerts that no longer add value.

Measure alert effectiveness. Track false positives, missed incidents, and responder feedback. Use metrics to improve alerting continuously.

Test alerting frequently. Don’t wait for real incidents to discover alerting issues. Regular testing ensures reliability when needed.

Respect on-call. Only page for genuine emergencies. Everything else can wait for business hours or at least morning.

Document everything. Alert definitions, runbooks, escalation policies should be written and accessible. Institutional knowledge in someone’s head isn’t sufficient.

Iterate based on incidents. Post-mortems revealing missed or delayed alerts should result in alerting improvements.

Alerting transforms monitoring from passive observation into active notification, enabling rapid response to problems. Effective alerting requires balancing sensitivity and specificity, ensuring alerts reach appropriate responders, providing actionable information, and continuously refining based on experience. The goal isn’t alerting on everything but alerting on the right things at the right time to the right people, enabling teams to maintain reliable systems without succumbing to alert fatigue. When done well, alerting provides confidence that real problems will be detected and addressed promptly, even when no one is actively watching dashboards.