Alerting Done Right: Reducing Noise and Writing Actionable Alerts
Most alerts are noise. Learn how to write actionable alerts by focusing on symptoms, implementing hysteresis, using multi-window burn rate alerting, and routing through Alertmanager. Includes a five-question checklist for every alert.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Tuesday, 3:47 AM: Your Pager Goes Off for the Eighth Time This Week
It's a Tuesday at 3:47 AM. The on-call is paged about HighCPUUsage on api-worker-7. They pull up the dashboard on their phone: CPU is at 78%, already trending back down, no customer-facing errors on the SLO board, no change in latency on the checkout path. They acknowledge the page, open the runbook -- there isn't one -- and go back to sleep. Ten minutes later another page fires: DatabaseConnectionsHigh, 102 connections out of a 150 limit. Same thing, no user impact, self-resolves. By the time the week's on-call handover happens on Friday, this engineer has been paged 47 times; three pages correlated with an actual incident, the rest were noise. The incident in the middle, a real checkout failure at 11:14 PM on Wednesday, was acknowledged 22 minutes late because the engineer had muted their phone.
This is what broken alerting looks like in practice. It is not a tooling problem -- Prometheus and Grafana and PagerDuty all ship with the primitives to do alerting well. It is a design problem: the 47 pages above were almost all cause-based alerts (CPU high, memory high, disk busy, connection count rising) which fire constantly because infrastructure metrics fluctuate constantly, and the one page that actually mattered was lost in the noise. The on-call did exactly what humans always do when 90% of their alerts are false alarms: they learned to dismiss pages reflexively.
Alerting done right reverses that failure mode. Every page corresponds to real or imminent user impact. Every notification contains enough context to start diagnosis in under 30 seconds. Every alert links to a runbook. Pages per on-call week drop from dozens to single digits, and the engineer who is on call is actually available to respond when something real breaks. The rest of this article is the five-question checklist I run every alert through before it ships, the difference between symptom-based and cause-based alerts (with the five most common mistakes), how to add hysteresis in Prometheus so alerts stop flapping, how to build actionable runbooks, and what SLO-based burn-rate alerts look like in practice.
The Five Questions Checklist
Before creating any alert, run it through these five questions. If you can't answer yes to all of them, don't create the alert.
- Does this alert indicate real or imminent user impact? CPU at 80% isn't user impact. Error rate climbing above your SLO budget burn rate is. Alert on symptoms (error rate, latency, availability), not on causes (CPU, memory, disk).
- Does this require human action? If your auto-scaler handles it, don't page a human. If Kubernetes restarts the pod and the problem resolves, that's an event for a dashboard, not an alert.
- Is the alert specific enough to diagnose? "Something is wrong" is useless. "Order service error rate is 5.2% (SLO budget burn rate: 14x) in production-us-east-1" gives the responder a starting point.
- Will this alert fire at a sustainable frequency? If it fires more than once a week without representing a genuine incident, it will be ignored. Either fix the underlying issue or remove the alert.
- Does the runbook exist? Every alert should link to a runbook that describes the alert, likely causes, diagnostic steps, and remediation actions. No runbook, no alert.
Pro tip: Review your alert history monthly. Any alert that fired but required no action should be deleted or converted to a dashboard annotation. Any alert that fired and was immediately silenced should be tuned or removed. This ongoing hygiene is what separates good alerting from alert fatigue.
Alert on Symptoms, Not Causes
This is the single most important principle in alerting, and the one most teams get wrong.
| Cause-Based Alert (Bad) | Symptom-Based Alert (Good) |
|---|---|
| CPU usage > 80% | Request latency p99 > 2s for 5 minutes |
| Memory usage > 90% | Error rate exceeding SLO burn rate |
| Disk usage > 85% | Checkout success rate below 99% |
| Pod restart count > 3 | Availability SLI below target for 10 minutes |
| Database connections > 100 | Query latency p95 > 500ms |
Cause-based alerts are noisy because infrastructure metrics fluctuate constantly. CPU spikes during garbage collection. Memory climbs before a scheduled compaction. Pods restart during rolling deployments. None of these affect users if the system is designed correctly.
Symptom-based alerts fire only when users are actually affected -- or about to be. They also tell you what matters (latency is high, errors are up) rather than what might be contributing (CPU is high, but maybe it's fine).
Watch out: There are exceptions. Disk at 95% is a cause-based alert worth keeping because the consequence (complete service failure when disk fills) is severe and the remediation (expand volume, clean up) takes time. Keep cause-based alerts only for slow-moving conditions where the consequence is catastrophic and the fix isn't instant.
Hysteresis: Preventing Alert Flapping
Hysteresis means requiring a condition to persist for a period before firing, and requiring it to be clearly resolved before clearing. Without hysteresis, an alert that crosses the threshold, dips below for 10 seconds, and crosses again generates three notifications in two minutes. That's noise.
Implementing Hysteresis in Prometheus
Prometheus supports hysteresis through the for clause in alert rules:
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
description: "Service {{ $labels.service }} error rate is {{ $value | humanizePercentage }}."
runbook: "https://wiki.internal/runbooks/high-error-rate"
The for: 5m clause means the condition must be true for 5 continuous minutes before the alert fires. A brief spike that resolves in 2 minutes never pages anyone. This single setting eliminates a massive amount of noise.
Choosing the Right Duration
| Alert Severity | Suggested for Duration | Rationale |
|---|---|---|
| Critical (page) | 2-5 minutes | Short enough to catch real incidents, long enough to skip transients |
| Warning (Slack) | 10-15 minutes | Confirms the issue is sustained, not a blip |
| Info (dashboard) | 30+ minutes or no alert | Log as an event, don't notify anyone |
Multi-Window, Multi-Burn-Rate Alerts
For SLO-based alerting, the gold standard is the multi-window, multi-burn-rate approach from Google's SRE workbook. Instead of alerting on a simple error rate threshold, you alert when the rate of error budget consumption indicates you'll exhaust the budget prematurely.
How It Works
You define pairs of time windows: a long window for accuracy and a short window for responsiveness. Both must exceed the burn rate threshold for the alert to fire.
groups:
- name: slo-burn-rate-alerts
rules:
# Page: 14.4x burn rate
# Long window: 1h, Short window: 5m
# At this rate, 30-day budget exhausted in ~2 days
- alert: SLOBudgetBurnCritical
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (14.4 * 0.001)
and
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: "Critical SLO budget burn -- 14.4x rate"
description: "At this burn rate, the 30-day error budget will be exhausted in approximately 2 days."
# Ticket: 3x burn rate
# Long window: 3d, Short window: 6h
# At this rate, budget exhausted in ~10 days
- alert: SLOBudgetBurnSlow
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[3d]))
/
sum(rate(http_requests_total[3d]))
)
) > (3 * 0.001)
and
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
)
) > (3 * 0.001)
for: 30m
labels:
severity: warning
slo: availability
annotations:
summary: "Elevated SLO budget burn -- 3x rate"
description: "Error budget is being consumed at 3x the sustainable rate. At this rate, the monthly budget will be exhausted in approximately 10 days."
The 14.4x burn rate with a 1-hour window catches severe incidents fast -- page the on-call. The 3x burn rate with a 3-day window catches slow degradation -- file a ticket. This layered approach ensures you respond proportionally to the severity.
Alertmanager Routing: Getting Alerts to the Right Place
Having good alert rules is half the battle. The other half is routing those alerts to the right team through the right channel at the right urgency.
# alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: default-slack
group_by: ['alertname', 'service', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical: page on-call via PagerDuty
- match:
severity: critical
receiver: pagerduty
repeat_interval: 1h
continue: false
# Warning: Slack channel, don't page
- match:
severity: warning
receiver: team-slack
repeat_interval: 4h
# SLO alerts: dedicated channel
- match_re:
slo: ".+"
receiver: slo-channel
group_by: ['slo', 'service']
receivers:
- name: default-slack
slack_configs:
- channel: '#alerts-default'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: pagerduty
pagerduty_configs:
- service_key: ${PD_SERVICE_KEY}
description: '{{ .GroupLabels.alertname }} - {{ .CommonAnnotations.summary }}'
- name: team-slack
slack_configs:
- channel: '#alerts-warnings'
send_resolved: true
- name: slo-channel
slack_configs:
- channel: '#slo-alerts'
send_resolved: true
inhibit_rules:
# If a critical alert fires, suppress the warning for the same service
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'service']
Key Routing Principles
- Group related alerts. If five pods in the same service hit the error threshold simultaneously, send one grouped notification, not five.
- Use inhibition rules. If a critical alert fires, suppress the warning-level alert for the same issue. The on-call already knows.
- Set appropriate repeat intervals. Critical alerts can repeat hourly. Warnings should repeat every 4+ hours. More frequent repeats create noise without adding value.
- Always send resolved notifications. The on-call needs to know when an issue clears, not just when it fires.
Building Effective Alert Annotations
An alert that says "HighErrorRate" is incomplete. Good annotations turn an alert into a mini incident brief:
annotations:
summary: "Order service error rate at {{ $value | humanizePercentage }}"
description: |
The order-service in {{ $labels.namespace }} has an error rate of
{{ $value | humanizePercentage }} over the last 5 minutes.
SLO target: 99.9% (0.1% error budget).
Current burn rate: approximately 14x.
impact: "Users may experience failed checkout attempts."
runbook: "https://wiki.internal/runbooks/order-service-errors"
dashboard: "https://grafana.internal/d/order-service"
logs: "https://grafana.internal/explore?query={app=order-service}|=error"
Include the current value, the threshold, the impact, a runbook link, a dashboard link, and a log query link. The on-call engineer should be able to go from "phone buzzing" to "actively investigating" in under a minute.
Alerting Tool and Service Costs
Most alerting tooling is bundled with monitoring platforms, but here's what dedicated alerting costs look like:
| Tool/Service | Monthly Cost | What You Get |
|---|---|---|
| Alertmanager (open source) | $0 + infrastructure | Routing, grouping, silencing, inhibition |
| PagerDuty | $21-41/user | On-call scheduling, escalations, incident management |
| Opsgenie | $9-35/user | On-call scheduling, escalations, integrations |
| Grafana OnCall (open source) | $0 + infrastructure | On-call scheduling, escalations, Grafana-native |
| Grafana OnCall (Cloud) | Included in Pro plan | Managed on-call with Grafana integration |
PagerDuty is the industry standard for on-call management, and for good reason -- its escalation policies, scheduling, and mobile app are mature. But at $41/user for the Business plan, costs add up. Grafana OnCall is a strong open-source alternative if you're already in the Grafana ecosystem.
Frequently Asked Questions
How many alerts should a team have?
There's no magic number, but a healthy range is 5-15 alerts per service. Below 5, you're probably missing important failure modes. Above 15, you likely have redundant or overly specific alerts that should be consolidated. The real metric is pages per on-call shift: if your on-call engineer gets paged more than 2-3 times per week for genuine incidents, you have a reliability problem, not an alerting problem.
Should I alert on log patterns or metrics?
Prefer metrics for alerting. Metrics are numeric, cheap to evaluate, and designed for aggregation. Log-based alerts require scanning text, which is slower and more expensive. Use log-based alerts only for conditions that can't be expressed as metrics -- specific error messages, audit events, or security patterns. If you find yourself creating many log-based alerts, that's a sign you need to emit those conditions as metrics instead.
What is alert fatigue and how do I prevent it?
Alert fatigue occurs when engineers receive so many alerts that they start ignoring them. Prevention requires discipline: delete alerts that don't require action, increase hysteresis durations, use grouping to reduce notification volume, and review alert history monthly. The litmus test is simple -- if an on-call engineer routinely ignores alerts without investigating, you have alert fatigue and your alerting system is providing negative value.
How do I handle flapping alerts?
Flapping alerts fire and resolve repeatedly in quick succession. Fix them by increasing the for duration (require the condition to persist longer), widening the aggregation window (use rate()[10m] instead of rate()[1m]), or adding a hysteresis band (fire at 5% error rate, resolve at 2%). If the underlying condition genuinely fluctuates around the threshold, the threshold itself may be wrong.
Should I use static thresholds or dynamic baselines?
Static thresholds work for most teams. They're simple, predictable, and easy to debug. Dynamic baselines (anomaly detection) sound appealing but produce false positives during legitimate traffic changes -- deployments, marketing campaigns, seasonal patterns. If you do use dynamic baselines, combine them with static thresholds as a safety net. Alert when the metric is both anomalous AND exceeding a static floor.
How do I alert on services with low traffic?
Low-traffic services break standard rate-based alerts because a single error can spike the error rate to 50%. Two approaches work: use absolute error counts instead of rates (alert when errors exceed 5 in 10 minutes), or use longer evaluation windows (rate over 1 hour instead of 5 minutes). For very low traffic, synthetic monitoring (periodic health checks) is more reliable than traffic-based alerting.
What should my alert severity levels be?
Keep it simple with three levels. Critical: pages the on-call, requires immediate response, represents active user impact. Warning: sends to Slack, requires attention within business hours, represents degradation that will escalate if ignored. Info: logged to a dashboard, no notification, used for tracking trends. Avoid creating more than three levels -- additional granularity just creates confusion about response expectations.
Conclusion
Good alerting is a practice, not a configuration. It requires ongoing maintenance: monthly reviews to delete noisy alerts, post-incident follow-ups to add missing alerts, and continuous tuning of thresholds and durations. The five-question checklist is your starting point: every alert must indicate user impact, require human action, provide enough context for diagnosis, fire at a sustainable frequency, and link to a runbook.
Implement hysteresis on everything. Use multi-window burn rate alerts for SLOs. Route critical alerts to PagerDuty and warnings to Slack. Include dashboard links, log queries, and runbook URLs in every alert annotation. And most importantly, treat alert noise as a bug -- every false positive erodes trust in the system and increases the chance that someone ignores the real thing.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
AIOps in 2026: AI-Driven Monitoring & Incident Response
AIOps in 2026 cuts alert noise 70-95% and Sev-2 MTTR 20-40% when layered on disciplined alerting. Landscape review of Dynatrace Davis, Datadog Watchdog, PagerDuty AIOps, BigPanda, and 6 more — with honest failure modes.
16 min read
ObservabilityBest Log Management Tools (2026): Splunk vs Datadog Logs vs Loki vs SigNoz
Benchmarked comparison of Splunk, Datadog Logs, Grafana Loki, and SigNoz on a 1.2 TB/day pipeline. Real 2026 pricing, query performance, and a cost-per-GB decision matrix.
15 min read
ObservabilityOpenTelemetry vs Datadog: Open Standard or Managed Platform?
Compare OpenTelemetry and Datadog across total cost of ownership, instrumentation, vendor lock-in, and architecture. TCO at 10, 50, and 200 services, OTel Collector pipeline config, hybrid approach, and a phased migration guide.
13 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.