Actionable Alerting: Reduce Noise, Write Better Alerts

It's a Tuesday at 3:47 AM. The on-call is paged about HighCPUUsage on api-worker-7. They pull up the dashboard on their phone: CPU is at 78%, already trending back down, no customer-facing errors on the SLO board, no change in latency on the checkout path. They acknowledge the page, open the runbook -- there isn't one -- and go back to sleep. Ten minutes later another page fires: DatabaseConnectionsHigh, 102 connections out of a 150 limit. Same thing, no user impact, self-resolves. By the time the week's on-call handover happens on Friday, this engineer has been paged 47 times; three pages correlated with an actual incident, the rest were noise. The incident in the middle, a real checkout failure at 11:14 PM on Wednesday, was acknowledged 22 minutes late because the engineer had muted their phone.

This is what broken alerting looks like in practice. It is not a tooling problem -- Prometheus and Grafana and PagerDuty all ship with the primitives to do alerting well. It is a design problem: the 47 pages above were almost all cause-based alerts (CPU high, memory high, disk busy, connection count rising) which fire constantly because infrastructure metrics fluctuate constantly, and the one page that actually mattered was lost in the noise. The on-call did exactly what humans always do when 90% of their alerts are false alarms: they learned to dismiss pages reflexively.

Alerting done right reverses that failure mode. Every page corresponds to real or imminent user impact. Every notification contains enough context to start diagnosis in under 30 seconds. Every alert links to a runbook. Pages per on-call week drop from dozens to single digits, and the engineer who is on call is actually available to respond when something real breaks. The rest of this article is the five-question checklist I run every alert through before it ships, the difference between symptom-based and cause-based alerts (with the five most common mistakes), how to add hysteresis in Prometheus so alerts stop flapping, how to build actionable runbooks, and what SLO-based burn-rate alerts look like in practice.

The Five Questions Checklist

Before creating any alert, run it through these five questions. If you can't answer yes to all of them, don't create the alert.

Does this alert indicate real or imminent user impact? CPU at 80% isn't user impact. Error rate climbing above your SLO budget burn rate is. Alert on symptoms (error rate, latency, availability), not on causes (CPU, memory, disk).
Does this require human action? If your auto-scaler handles it, don't page a human. If Kubernetes restarts the pod and the problem resolves, that's an event for a dashboard, not an alert.
Is the alert specific enough to diagnose? "Something is wrong" is useless. "Order service error rate is 5.2% (SLO budget burn rate: 14x) in production-us-east-1" gives the responder a starting point.
Will this alert fire at a sustainable frequency? If it fires more than once a week without representing a genuine incident, it will be ignored. Either fix the underlying issue or remove the alert.
Does the runbook exist? Every alert should link to a runbook that describes the alert, likely causes, diagnostic steps, and remediation actions. No runbook, no alert.

Pro tip: Review your alert history monthly. Any alert that fired but required no action should be deleted or converted to a dashboard annotation. Any alert that fired and was immediately silenced should be tuned or removed. This ongoing hygiene is what separates good alerting from alert fatigue.

Alert on Symptoms, Not Causes

This is the single most important principle in alerting, and the one most teams get wrong.

Cause-Based Alert (Bad)	Symptom-Based Alert (Good)
CPU usage > 80%	Request latency p99 > 2s for 5 minutes
Memory usage > 90%	Error rate exceeding SLO burn rate
Disk usage > 85%	Checkout success rate below 99%
Pod restart count > 3	Availability SLI below target for 10 minutes
Database connections > 100	Query latency p95 > 500ms

Cause-based alerts are noisy because infrastructure metrics fluctuate constantly. CPU spikes during garbage collection. Memory climbs before a scheduled compaction. Pods restart during rolling deployments. None of these affect users if the system is designed correctly.

Symptom-based alerts fire only when users are actually affected -- or about to be. They also tell you what matters (latency is high, errors are up) rather than what might be contributing (CPU is high, but maybe it's fine).

Watch out: There are exceptions. Disk at 95% is a cause-based alert worth keeping because the consequence (complete service failure when disk fills) is severe and the remediation (expand volume, clean up) takes time. Keep cause-based alerts only for slow-moving conditions where the consequence is catastrophic and the fix isn't instant.

Hysteresis: Preventing Alert Flapping

Hysteresis means requiring a condition to persist for a period before firing, and requiring it to be clearly resolved before clearing. Without hysteresis, an alert that crosses the threshold, dips below for 10 seconds, and crosses again generates three notifications in two minutes. That's noise.

Implementing Hysteresis in Prometheus

Prometheus supports hysteresis through the for clause in alert rules:

groups:
  - name: service-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 5 minutes"
          description: "Service {{ $labels.service }} error rate is {{ $value | humanizePercentage }}."
          runbook: "https://wiki.internal/runbooks/high-error-rate"

The for: 5m clause means the condition must be true for 5 continuous minutes before the alert fires. A brief spike that resolves in 2 minutes never pages anyone. This single setting eliminates a massive amount of noise.

Choosing the Right Duration

Alert Severity	Suggested `for` Duration	Rationale
Critical (page)	2-5 minutes	Short enough to catch real incidents, long enough to skip transients
Warning (Slack)	10-15 minutes	Confirms the issue is sustained, not a blip
Info (dashboard)	30+ minutes or no alert	Log as an event, don't notify anyone

Multi-Window, Multi-Burn-Rate Alerts

For SLO-based alerting, the gold standard is the multi-window, multi-burn-rate approach from Google's SRE workbook. Instead of alerting on a simple error rate threshold, you alert when the rate of error budget consumption indicates you'll exhaust the budget prematurely.

How It Works

You define pairs of time windows: a long window for accuracy and a short window for responsiveness. Both must exceed the burn rate threshold for the alert to fire.

groups:
  - name: slo-burn-rate-alerts
    rules:
      # Page: 14.4x burn rate
      # Long window: 1h, Short window: 5m
      # At this rate, 30-day budget exhausted in ~2 days
      - alert: SLOBudgetBurnCritical
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
          ) > (14.4 * 0.001)
          and
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[5m]))
              /
              sum(rate(http_requests_total[5m]))
            )
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "Critical SLO budget burn -- 14.4x rate"
          description: "At this burn rate, the 30-day error budget will be exhausted in approximately 2 days."

      # Ticket: 3x burn rate
      # Long window: 3d, Short window: 6h
      # At this rate, budget exhausted in ~10 days
      - alert: SLOBudgetBurnSlow
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[3d]))
              /
              sum(rate(http_requests_total[3d]))
            )
          ) > (3 * 0.001)
          and
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[6h]))
              /
              sum(rate(http_requests_total[6h]))
            )
          ) > (3 * 0.001)
        for: 30m
        labels:
          severity: warning
          slo: availability
        annotations:
          summary: "Elevated SLO budget burn -- 3x rate"
          description: "Error budget is being consumed at 3x the sustainable rate. At this rate, the monthly budget will be exhausted in approximately 10 days."

The 14.4x burn rate with a 1-hour window catches severe incidents fast -- page the on-call. The 3x burn rate with a 3-day window catches slow degradation -- file a ticket. This layered approach ensures you respond proportionally to the severity.

Alertmanager Routing: Getting Alerts to the Right Place

Having good alert rules is half the battle. The other half is routing those alerts to the right team through the right channel at the right urgency.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: default-slack
  group_by: ['alertname', 'service', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # Critical: page on-call via PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      repeat_interval: 1h
      continue: false

    # Warning: Slack channel, don't page
    - match:
        severity: warning
      receiver: team-slack
      repeat_interval: 4h

    # SLO alerts: dedicated channel
    - match_re:
        slo: ".+"
      receiver: slo-channel
      group_by: ['slo', 'service']

receivers:
  - name: default-slack
    slack_configs:
      - channel: '#alerts-default'
        send_resolved: true
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: pagerduty
    pagerduty_configs:
      - service_key: ${PD_SERVICE_KEY}
        description: '{{ .GroupLabels.alertname }} - {{ .CommonAnnotations.summary }}'

  - name: team-slack
    slack_configs:
      - channel: '#alerts-warnings'
        send_resolved: true

  - name: slo-channel
    slack_configs:
      - channel: '#slo-alerts'
        send_resolved: true

inhibit_rules:
  # If a critical alert fires, suppress the warning for the same service
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'service']

Key Routing Principles

Group related alerts. If five pods in the same service hit the error threshold simultaneously, send one grouped notification, not five.
Use inhibition rules. If a critical alert fires, suppress the warning-level alert for the same issue. The on-call already knows.
Set appropriate repeat intervals. Critical alerts can repeat hourly. Warnings should repeat every 4+ hours. More frequent repeats create noise without adding value.
Always send resolved notifications. The on-call needs to know when an issue clears, not just when it fires.

Building Effective Alert Annotations

An alert that says "HighErrorRate" is incomplete. Good annotations turn an alert into a mini incident brief:

annotations:
  summary: "Order service error rate at {{ $value | humanizePercentage }}"
  description: |
    The order-service in {{ $labels.namespace }} has an error rate of
    {{ $value | humanizePercentage }} over the last 5 minutes.
    SLO target: 99.9% (0.1% error budget).
    Current burn rate: approximately 14x.
  impact: "Users may experience failed checkout attempts."
  runbook: "https://wiki.internal/runbooks/order-service-errors"
  dashboard: "https://grafana.internal/d/order-service"
  logs: "https://grafana.internal/explore?query={app=order-service}|=error"

Include the current value, the threshold, the impact, a runbook link, a dashboard link, and a log query link. The on-call engineer should be able to go from "phone buzzing" to "actively investigating" in under a minute.

Alerting Tool and Service Costs

Most alerting tooling is bundled with monitoring platforms, but here's what dedicated alerting costs look like:

Tool/Service	Monthly Cost	What You Get
Alertmanager (open source)	$0 + infrastructure	Routing, grouping, silencing, inhibition
PagerDuty	$21-41/user	On-call scheduling, escalations, incident management
Opsgenie	$9-35/user	On-call scheduling, escalations, integrations
Grafana OnCall (open source)	$0 + infrastructure	On-call scheduling, escalations, Grafana-native
Grafana OnCall (Cloud)	Included in Pro plan	Managed on-call with Grafana integration

PagerDuty is the industry standard for on-call management, and for good reason -- its escalation policies, scheduling, and mobile app are mature. But at $41/user for the Business plan, costs add up. Grafana OnCall is a strong open-source alternative if you're already in the Grafana ecosystem.

Frequently Asked Questions

How many alerts should a team have?

There's no magic number, but a healthy range is 5-15 alerts per service. Below 5, you're probably missing important failure modes. Above 15, you likely have redundant or overly specific alerts that should be consolidated. The real metric is pages per on-call shift: if your on-call engineer gets paged more than 2-3 times per week for genuine incidents, you have a reliability problem, not an alerting problem.

Should I alert on log patterns or metrics?

Prefer metrics for alerting. Metrics are numeric, cheap to evaluate, and designed for aggregation. Log-based alerts require scanning text, which is slower and more expensive. Use log-based alerts only for conditions that can't be expressed as metrics -- specific error messages, audit events, or security patterns. If you find yourself creating many log-based alerts, that's a sign you need to emit those conditions as metrics instead.

What is alert fatigue and how do I prevent it?

Alert fatigue occurs when engineers receive so many alerts that they start ignoring them. Prevention requires discipline: delete alerts that don't require action, increase hysteresis durations, use grouping to reduce notification volume, and review alert history monthly. The litmus test is simple -- if an on-call engineer routinely ignores alerts without investigating, you have alert fatigue and your alerting system is providing negative value.

How do I handle flapping alerts?

Flapping alerts fire and resolve repeatedly in quick succession. Fix them by increasing the for duration (require the condition to persist longer), widening the aggregation window (use rate()[10m] instead of rate()[1m]), or adding a hysteresis band (fire at 5% error rate, resolve at 2%). If the underlying condition genuinely fluctuates around the threshold, the threshold itself may be wrong.

Should I use static thresholds or dynamic baselines?

Static thresholds work for most teams. They're simple, predictable, and easy to debug. Dynamic baselines (anomaly detection) sound appealing but produce false positives during legitimate traffic changes -- deployments, marketing campaigns, seasonal patterns. If you do use dynamic baselines, combine them with static thresholds as a safety net. Alert when the metric is both anomalous AND exceeding a static floor.

How do I alert on services with low traffic?

Low-traffic services break standard rate-based alerts because a single error can spike the error rate to 50%. Two approaches work: use absolute error counts instead of rates (alert when errors exceed 5 in 10 minutes), or use longer evaluation windows (rate over 1 hour instead of 5 minutes). For very low traffic, synthetic monitoring (periodic health checks) is more reliable than traffic-based alerting.

What should my alert severity levels be?

Keep it simple with three levels. Critical: pages the on-call, requires immediate response, represents active user impact. Warning: sends to Slack, requires attention within business hours, represents degradation that will escalate if ignored. Info: logged to a dashboard, no notification, used for tracking trends. Avoid creating more than three levels -- additional granularity just creates confusion about response expectations.

Conclusion

Good alerting is a practice, not a configuration. It requires ongoing maintenance: monthly reviews to delete noisy alerts, post-incident follow-ups to add missing alerts, and continuous tuning of thresholds and durations. The five-question checklist is your starting point: every alert must indicate user impact, require human action, provide enough context for diagnosis, fire at a sustainable frequency, and link to a runbook.

Implement hysteresis on everything. Use multi-window burn rate alerts for SLOs. Route critical alerts to PagerDuty and warnings to Slack. Include dashboard links, log queries, and runbook URLs in every alert annotation. And most importantly, treat alert noise as a bug -- every false positive erodes trust in the system and increases the chance that someone ignores the real thing.

Alerting Done Right: Reducing Noise and Writing Actionable Alerts

The Five Questions Checklist

Alert on Symptoms, Not Causes

Hysteresis: Preventing Alert Flapping

Implementing Hysteresis in Prometheus

Choosing the Right Duration

Multi-Window, Multi-Burn-Rate Alerts

How It Works

Alertmanager Routing: Getting Alerts to the Right Place

Key Routing Principles

Building Effective Alert Annotations

Alerting Tool and Service Costs

Frequently Asked Questions

How many alerts should a team have?

Should I alert on log patterns or metrics?

What is alert fatigue and how do I prevent it?

How do I handle flapping alerts?

Should I use static thresholds or dynamic baselines?

How do I alert on services with low traffic?

What should my alert severity levels be?

Conclusion

Related Articles

Enjoyed this article?

Comments

Leave a comment

Stay in the loop

Tuesday, 3:47 AM: Your Pager Goes Off for the Eighth Time This Week

The Five Questions Checklist

Alert on Symptoms, Not Causes

Hysteresis: Preventing Alert Flapping

Implementing Hysteresis in Prometheus

Choosing the Right Duration

Multi-Window, Multi-Burn-Rate Alerts

How It Works

Alertmanager Routing: Getting Alerts to the Right Place

Key Routing Principles

Building Effective Alert Annotations

Alerting Tool and Service Costs

Frequently Asked Questions

How many alerts should a team have?

Should I alert on log patterns or metrics?

What is alert fatigue and how do I prevent it?

How do I handle flapping alerts?

Should I use static thresholds or dynamic baselines?

How do I alert on services with low traffic?

What should my alert severity levels be?

Conclusion

Related Articles

Enjoyed this article?

Comments

Leave a comment

Stay in the loop