Alerting Done Right: Reducing Noise and Writing Actionable Alerts
Most alerts are noise. Learn how to write actionable alerts by focusing on symptoms, implementing hysteresis, using multi-window burn rate alerting, and routing through Alertmanager. Includes a five-question checklist for every alert.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Most Alerts Are Garbage -- Here's How to Fix That
If your on-call engineer gets paged at 2 AM for a CPU spike that resolved itself, that's a bad alert. If they get 47 notifications in a 10-minute window for the same incident, that's a broken alerting system. Alerting done right means every notification is actionable, every page represents real user impact, and the on-call engineer can understand what's wrong and what to do about it within 30 seconds of reading the alert.
I've seen teams where on-call meant ignoring 90% of alerts because they were noise. The engineers who join those teams learn to dismiss pages reflexively -- and then they miss the one that actually matters. Good alerting isn't about catching everything. It's about catching the right things and staying quiet the rest of the time.
What Is Actionable Alerting?
Definition: Actionable alerting is a practice where every alert notification requires human intervention, contains enough context for the responder to begin diagnosis, and represents a condition that will impact users or systems if left unaddressed. Non-actionable alerts -- those that resolve on their own or require no response -- are noise and should be eliminated.
The goal is simple: when a page fires, the on-call engineer should be able to answer three questions immediately: What is broken? Who is affected? What should I do first?
The Five Questions Checklist
Before creating any alert, run it through these five questions. If you can't answer yes to all of them, don't create the alert.
- Does this alert indicate real or imminent user impact? CPU at 80% isn't user impact. Error rate climbing above your SLO budget burn rate is. Alert on symptoms (error rate, latency, availability), not on causes (CPU, memory, disk).
- Does this require human action? If your auto-scaler handles it, don't page a human. If Kubernetes restarts the pod and the problem resolves, that's an event for a dashboard, not an alert.
- Is the alert specific enough to diagnose? "Something is wrong" is useless. "Order service error rate is 5.2% (SLO budget burn rate: 14x) in production-us-east-1" gives the responder a starting point.
- Will this alert fire at a sustainable frequency? If it fires more than once a week without representing a genuine incident, it will be ignored. Either fix the underlying issue or remove the alert.
- Does the runbook exist? Every alert should link to a runbook that describes the alert, likely causes, diagnostic steps, and remediation actions. No runbook, no alert.
Pro tip: Review your alert history monthly. Any alert that fired but required no action should be deleted or converted to a dashboard annotation. Any alert that fired and was immediately silenced should be tuned or removed. This ongoing hygiene is what separates good alerting from alert fatigue.
Alert on Symptoms, Not Causes
This is the single most important principle in alerting, and the one most teams get wrong.
| Cause-Based Alert (Bad) | Symptom-Based Alert (Good) |
|---|---|
| CPU usage > 80% | Request latency p99 > 2s for 5 minutes |
| Memory usage > 90% | Error rate exceeding SLO burn rate |
| Disk usage > 85% | Checkout success rate below 99% |
| Pod restart count > 3 | Availability SLI below target for 10 minutes |
| Database connections > 100 | Query latency p95 > 500ms |
Cause-based alerts are noisy because infrastructure metrics fluctuate constantly. CPU spikes during garbage collection. Memory climbs before a scheduled compaction. Pods restart during rolling deployments. None of these affect users if the system is designed correctly.
Symptom-based alerts fire only when users are actually affected -- or about to be. They also tell you what matters (latency is high, errors are up) rather than what might be contributing (CPU is high, but maybe it's fine).
Watch out: There are exceptions. Disk at 95% is a cause-based alert worth keeping because the consequence (complete service failure when disk fills) is severe and the remediation (expand volume, clean up) takes time. Keep cause-based alerts only for slow-moving conditions where the consequence is catastrophic and the fix isn't instant.
Hysteresis: Preventing Alert Flapping
Hysteresis means requiring a condition to persist for a period before firing, and requiring it to be clearly resolved before clearing. Without hysteresis, an alert that crosses the threshold, dips below for 10 seconds, and crosses again generates three notifications in two minutes. That's noise.
Implementing Hysteresis in Prometheus
Prometheus supports hysteresis through the for clause in alert rules:
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
description: "Service {{ $labels.service }} error rate is {{ $value | humanizePercentage }}."
runbook: "https://wiki.internal/runbooks/high-error-rate"
The for: 5m clause means the condition must be true for 5 continuous minutes before the alert fires. A brief spike that resolves in 2 minutes never pages anyone. This single setting eliminates a massive amount of noise.
Choosing the Right Duration
| Alert Severity | Suggested for Duration | Rationale |
|---|---|---|
| Critical (page) | 2-5 minutes | Short enough to catch real incidents, long enough to skip transients |
| Warning (Slack) | 10-15 minutes | Confirms the issue is sustained, not a blip |
| Info (dashboard) | 30+ minutes or no alert | Log as an event, don't notify anyone |
Multi-Window, Multi-Burn-Rate Alerts
For SLO-based alerting, the gold standard is the multi-window, multi-burn-rate approach from Google's SRE workbook. Instead of alerting on a simple error rate threshold, you alert when the rate of error budget consumption indicates you'll exhaust the budget prematurely.
How It Works
You define pairs of time windows: a long window for accuracy and a short window for responsiveness. Both must exceed the burn rate threshold for the alert to fire.
groups:
- name: slo-burn-rate-alerts
rules:
# Page: 14.4x burn rate
# Long window: 1h, Short window: 5m
# At this rate, 30-day budget exhausted in ~2 days
- alert: SLOBudgetBurnCritical
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (14.4 * 0.001)
and
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: "Critical SLO budget burn -- 14.4x rate"
description: "At this burn rate, the 30-day error budget will be exhausted in approximately 2 days."
# Ticket: 3x burn rate
# Long window: 3d, Short window: 6h
# At this rate, budget exhausted in ~10 days
- alert: SLOBudgetBurnSlow
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[3d]))
/
sum(rate(http_requests_total[3d]))
)
) > (3 * 0.001)
and
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
)
) > (3 * 0.001)
for: 30m
labels:
severity: warning
slo: availability
annotations:
summary: "Elevated SLO budget burn -- 3x rate"
description: "Error budget is being consumed at 3x the sustainable rate. At this rate, the monthly budget will be exhausted in approximately 10 days."
The 14.4x burn rate with a 1-hour window catches severe incidents fast -- page the on-call. The 3x burn rate with a 3-day window catches slow degradation -- file a ticket. This layered approach ensures you respond proportionally to the severity.
Alertmanager Routing: Getting Alerts to the Right Place
Having good alert rules is half the battle. The other half is routing those alerts to the right team through the right channel at the right urgency.
# alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: default-slack
group_by: ['alertname', 'service', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical: page on-call via PagerDuty
- match:
severity: critical
receiver: pagerduty
repeat_interval: 1h
continue: false
# Warning: Slack channel, don't page
- match:
severity: warning
receiver: team-slack
repeat_interval: 4h
# SLO alerts: dedicated channel
- match_re:
slo: ".+"
receiver: slo-channel
group_by: ['slo', 'service']
receivers:
- name: default-slack
slack_configs:
- channel: '#alerts-default'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: pagerduty
pagerduty_configs:
- service_key: ${PD_SERVICE_KEY}
description: '{{ .GroupLabels.alertname }} - {{ .CommonAnnotations.summary }}'
- name: team-slack
slack_configs:
- channel: '#alerts-warnings'
send_resolved: true
- name: slo-channel
slack_configs:
- channel: '#slo-alerts'
send_resolved: true
inhibit_rules:
# If a critical alert fires, suppress the warning for the same service
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'service']
Key Routing Principles
- Group related alerts. If five pods in the same service hit the error threshold simultaneously, send one grouped notification, not five.
- Use inhibition rules. If a critical alert fires, suppress the warning-level alert for the same issue. The on-call already knows.
- Set appropriate repeat intervals. Critical alerts can repeat hourly. Warnings should repeat every 4+ hours. More frequent repeats create noise without adding value.
- Always send resolved notifications. The on-call needs to know when an issue clears, not just when it fires.
Building Effective Alert Annotations
An alert that says "HighErrorRate" is incomplete. Good annotations turn an alert into a mini incident brief:
annotations:
summary: "Order service error rate at {{ $value | humanizePercentage }}"
description: |
The order-service in {{ $labels.namespace }} has an error rate of
{{ $value | humanizePercentage }} over the last 5 minutes.
SLO target: 99.9% (0.1% error budget).
Current burn rate: approximately 14x.
impact: "Users may experience failed checkout attempts."
runbook: "https://wiki.internal/runbooks/order-service-errors"
dashboard: "https://grafana.internal/d/order-service"
logs: "https://grafana.internal/explore?query={app=order-service}|=error"
Include the current value, the threshold, the impact, a runbook link, a dashboard link, and a log query link. The on-call engineer should be able to go from "phone buzzing" to "actively investigating" in under a minute.
Alerting Tool and Service Costs
Most alerting tooling is bundled with monitoring platforms, but here's what dedicated alerting costs look like:
| Tool/Service | Monthly Cost | What You Get |
|---|---|---|
| Alertmanager (open source) | $0 + infrastructure | Routing, grouping, silencing, inhibition |
| PagerDuty | $21-41/user | On-call scheduling, escalations, incident management |
| Opsgenie | $9-35/user | On-call scheduling, escalations, integrations |
| Grafana OnCall (open source) | $0 + infrastructure | On-call scheduling, escalations, Grafana-native |
| Grafana OnCall (Cloud) | Included in Pro plan | Managed on-call with Grafana integration |
PagerDuty is the industry standard for on-call management, and for good reason -- its escalation policies, scheduling, and mobile app are mature. But at $41/user for the Business plan, costs add up. Grafana OnCall is a strong open-source alternative if you're already in the Grafana ecosystem.
Frequently Asked Questions
How many alerts should a team have?
There's no magic number, but a healthy range is 5-15 alerts per service. Below 5, you're probably missing important failure modes. Above 15, you likely have redundant or overly specific alerts that should be consolidated. The real metric is pages per on-call shift: if your on-call engineer gets paged more than 2-3 times per week for genuine incidents, you have a reliability problem, not an alerting problem.
Should I alert on log patterns or metrics?
Prefer metrics for alerting. Metrics are numeric, cheap to evaluate, and designed for aggregation. Log-based alerts require scanning text, which is slower and more expensive. Use log-based alerts only for conditions that can't be expressed as metrics -- specific error messages, audit events, or security patterns. If you find yourself creating many log-based alerts, that's a sign you need to emit those conditions as metrics instead.
What is alert fatigue and how do I prevent it?
Alert fatigue occurs when engineers receive so many alerts that they start ignoring them. Prevention requires discipline: delete alerts that don't require action, increase hysteresis durations, use grouping to reduce notification volume, and review alert history monthly. The litmus test is simple -- if an on-call engineer routinely ignores alerts without investigating, you have alert fatigue and your alerting system is providing negative value.
How do I handle flapping alerts?
Flapping alerts fire and resolve repeatedly in quick succession. Fix them by increasing the for duration (require the condition to persist longer), widening the aggregation window (use rate()[10m] instead of rate()[1m]), or adding a hysteresis band (fire at 5% error rate, resolve at 2%). If the underlying condition genuinely fluctuates around the threshold, the threshold itself may be wrong.
Should I use static thresholds or dynamic baselines?
Static thresholds work for most teams. They're simple, predictable, and easy to debug. Dynamic baselines (anomaly detection) sound appealing but produce false positives during legitimate traffic changes -- deployments, marketing campaigns, seasonal patterns. If you do use dynamic baselines, combine them with static thresholds as a safety net. Alert when the metric is both anomalous AND exceeding a static floor.
How do I alert on services with low traffic?
Low-traffic services break standard rate-based alerts because a single error can spike the error rate to 50%. Two approaches work: use absolute error counts instead of rates (alert when errors exceed 5 in 10 minutes), or use longer evaluation windows (rate over 1 hour instead of 5 minutes). For very low traffic, synthetic monitoring (periodic health checks) is more reliable than traffic-based alerting.
What should my alert severity levels be?
Keep it simple with three levels. Critical: pages the on-call, requires immediate response, represents active user impact. Warning: sends to Slack, requires attention within business hours, represents degradation that will escalate if ignored. Info: logged to a dashboard, no notification, used for tracking trends. Avoid creating more than three levels -- additional granularity just creates confusion about response expectations.
Conclusion
Good alerting is a practice, not a configuration. It requires ongoing maintenance: monthly reviews to delete noisy alerts, post-incident follow-ups to add missing alerts, and continuous tuning of thresholds and durations. The five-question checklist is your starting point: every alert must indicate user impact, require human action, provide enough context for diagnosis, fire at a sustainable frequency, and link to a runbook.
Implement hysteresis on everything. Use multi-window burn rate alerts for SLOs. Route critical alerts to PagerDuty and warnings to Slack. Include dashboard links, log queries, and runbook URLs in every alert annotation. And most importantly, treat alert noise as a bug -- every false positive erodes trust in the system and increases the chance that someone ignores the real thing.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
How eBPF Is Changing Observability
eBPF enables kernel-level observability without application code changes. Learn how Cilium, Pixie, Falco, and bpftrace use eBPF for network monitoring, security, profiling, and tracing in production Kubernetes environments.
10 min read
ObservabilitySLOs, SLAs, and Error Budgets: Running Reliable Services
SLOs, SLAs, and error budgets turn reliability into a measurable resource. Learn how to choose SLIs, set realistic targets, calculate error budgets, and implement burn rate alerts with Prometheus.
11 min read
ObservabilityCentralized Log Management: Loki vs the ELK Stack vs CloudWatch
Compare Grafana Loki, the ELK Stack, and AWS CloudWatch Logs for centralized log management. Understand the architecture, query languages, cost tradeoffs, and which solution fits your team and infrastructure.
10 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.