Actionable Alerting: Reduce Noise, Write Better Alerts

Most Alerts Are Garbage -- Here's How to Fix That

If your on-call engineer gets paged at 2 AM for a CPU spike that resolved itself, that's a bad alert. If they get 47 notifications in a 10-minute window for the same incident, that's a broken alerting system. Alerting done right means every notification is actionable, every page represents real user impact, and the on-call engineer can understand what's wrong and what to do about it within 30 seconds of reading the alert.

I've seen teams where on-call meant ignoring 90% of alerts because they were noise. The engineers who join those teams learn to dismiss pages reflexively -- and then they miss the one that actually matters. Good alerting isn't about catching everything. It's about catching the right things and staying quiet the rest of the time.

What Is Actionable Alerting?

Definition: Actionable alerting is a practice where every alert notification requires human intervention, contains enough context for the responder to begin diagnosis, and represents a condition that will impact users or systems if left unaddressed. Non-actionable alerts -- those that resolve on their own or require no response -- are noise and should be eliminated.

The goal is simple: when a page fires, the on-call engineer should be able to answer three questions immediately: What is broken? Who is affected? What should I do first?

The Five Questions Checklist

Before creating any alert, run it through these five questions. If you can't answer yes to all of them, don't create the alert.

Does this alert indicate real or imminent user impact? CPU at 80% isn't user impact. Error rate climbing above your SLO budget burn rate is. Alert on symptoms (error rate, latency, availability), not on causes (CPU, memory, disk).
Does this require human action? If your auto-scaler handles it, don't page a human. If Kubernetes restarts the pod and the problem resolves, that's an event for a dashboard, not an alert.
Is the alert specific enough to diagnose? "Something is wrong" is useless. "Order service error rate is 5.2% (SLO budget burn rate: 14x) in production-us-east-1" gives the responder a starting point.
Will this alert fire at a sustainable frequency? If it fires more than once a week without representing a genuine incident, it will be ignored. Either fix the underlying issue or remove the alert.
Does the runbook exist? Every alert should link to a runbook that describes the alert, likely causes, diagnostic steps, and remediation actions. No runbook, no alert.

Pro tip: Review your alert history monthly. Any alert that fired but required no action should be deleted or converted to a dashboard annotation. Any alert that fired and was immediately silenced should be tuned or removed. This ongoing hygiene is what separates good alerting from alert fatigue.

Alert on Symptoms, Not Causes

This is the single most important principle in alerting, and the one most teams get wrong.

Cause-Based Alert (Bad)	Symptom-Based Alert (Good)
CPU usage > 80%	Request latency p99 > 2s for 5 minutes
Memory usage > 90%	Error rate exceeding SLO burn rate
Disk usage > 85%	Checkout success rate below 99%
Pod restart count > 3	Availability SLI below target for 10 minutes
Database connections > 100	Query latency p95 > 500ms

Cause-based alerts are noisy because infrastructure metrics fluctuate constantly. CPU spikes during garbage collection. Memory climbs before a scheduled compaction. Pods restart during rolling deployments. None of these affect users if the system is designed correctly.

Symptom-based alerts fire only when users are actually affected -- or about to be. They also tell you what matters (latency is high, errors are up) rather than what might be contributing (CPU is high, but maybe it's fine).

Watch out: There are exceptions. Disk at 95% is a cause-based alert worth keeping because the consequence (complete service failure when disk fills) is severe and the remediation (expand volume, clean up) takes time. Keep cause-based alerts only for slow-moving conditions where the consequence is catastrophic and the fix isn't instant.

Hysteresis: Preventing Alert Flapping

Hysteresis means requiring a condition to persist for a period before firing, and requiring it to be clearly resolved before clearing. Without hysteresis, an alert that crosses the threshold, dips below for 10 seconds, and crosses again generates three notifications in two minutes. That's noise.

Implementing Hysteresis in Prometheus

Prometheus supports hysteresis through the for clause in alert rules:

groups:
  - name: service-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 5 minutes"
          description: "Service {{ $labels.service }} error rate is {{ $value | humanizePercentage }}."
          runbook: "https://wiki.internal/runbooks/high-error-rate"

The for: 5m clause means the condition must be true for 5 continuous minutes before the alert fires. A brief spike that resolves in 2 minutes never pages anyone. This single setting eliminates a massive amount of noise.

Choosing the Right Duration

Alert Severity	Suggested `for` Duration	Rationale
Critical (page)	2-5 minutes	Short enough to catch real incidents, long enough to skip transients
Warning (Slack)	10-15 minutes	Confirms the issue is sustained, not a blip
Info (dashboard)	30+ minutes or no alert	Log as an event, don't notify anyone

Multi-Window, Multi-Burn-Rate Alerts

For SLO-based alerting, the gold standard is the multi-window, multi-burn-rate approach from Google's SRE workbook. Instead of alerting on a simple error rate threshold, you alert when the rate of error budget consumption indicates you'll exhaust the budget prematurely.

How It Works

You define pairs of time windows: a long window for accuracy and a short window for responsiveness. Both must exceed the burn rate threshold for the alert to fire.

groups:
  - name: slo-burn-rate-alerts
    rules:
      # Page: 14.4x burn rate
      # Long window: 1h, Short window: 5m
      # At this rate, 30-day budget exhausted in ~2 days
      - alert: SLOBudgetBurnCritical
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
          ) > (14.4 * 0.001)
          and
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[5m]))
              /
              sum(rate(http_requests_total[5m]))
            )
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "Critical SLO budget burn -- 14.4x rate"
          description: "At this burn rate, the 30-day error budget will be exhausted in approximately 2 days."

      # Ticket: 3x burn rate
      # Long window: 3d, Short window: 6h
      # At this rate, budget exhausted in ~10 days
      - alert: SLOBudgetBurnSlow
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[3d]))
              /
              sum(rate(http_requests_total[3d]))
            )
          ) > (3 * 0.001)
          and
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[6h]))
              /
              sum(rate(http_requests_total[6h]))
            )
          ) > (3 * 0.001)
        for: 30m
        labels:
          severity: warning
          slo: availability
        annotations:
          summary: "Elevated SLO budget burn -- 3x rate"
          description: "Error budget is being consumed at 3x the sustainable rate. At this rate, the monthly budget will be exhausted in approximately 10 days."

The 14.4x burn rate with a 1-hour window catches severe incidents fast -- page the on-call. The 3x burn rate with a 3-day window catches slow degradation -- file a ticket. This layered approach ensures you respond proportionally to the severity.

Alertmanager Routing: Getting Alerts to the Right Place

Having good alert rules is half the battle. The other half is routing those alerts to the right team through the right channel at the right urgency.

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: default-slack
  group_by: ['alertname', 'service', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # Critical: page on-call via PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      repeat_interval: 1h
      continue: false

    # Warning: Slack channel, don't page
    - match:
        severity: warning
      receiver: team-slack
      repeat_interval: 4h

    # SLO alerts: dedicated channel
    - match_re:
        slo: ".+"
      receiver: slo-channel
      group_by: ['slo', 'service']

receivers:
  - name: default-slack
    slack_configs:
      - channel: '#alerts-default'
        send_resolved: true
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: pagerduty
    pagerduty_configs:
      - service_key: ${PD_SERVICE_KEY}
        description: '{{ .GroupLabels.alertname }} - {{ .CommonAnnotations.summary }}'

  - name: team-slack
    slack_configs:
      - channel: '#alerts-warnings'
        send_resolved: true

  - name: slo-channel
    slack_configs:
      - channel: '#slo-alerts'
        send_resolved: true

inhibit_rules:
  # If a critical alert fires, suppress the warning for the same service
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'service']

Key Routing Principles

Group related alerts. If five pods in the same service hit the error threshold simultaneously, send one grouped notification, not five.
Use inhibition rules. If a critical alert fires, suppress the warning-level alert for the same issue. The on-call already knows.
Set appropriate repeat intervals. Critical alerts can repeat hourly. Warnings should repeat every 4+ hours. More frequent repeats create noise without adding value.
Always send resolved notifications. The on-call needs to know when an issue clears, not just when it fires.

Building Effective Alert Annotations

An alert that says "HighErrorRate" is incomplete. Good annotations turn an alert into a mini incident brief:

annotations:
  summary: "Order service error rate at {{ $value | humanizePercentage }}"
  description: |
    The order-service in {{ $labels.namespace }} has an error rate of
    {{ $value | humanizePercentage }} over the last 5 minutes.
    SLO target: 99.9% (0.1% error budget).
    Current burn rate: approximately 14x.
  impact: "Users may experience failed checkout attempts."
  runbook: "https://wiki.internal/runbooks/order-service-errors"
  dashboard: "https://grafana.internal/d/order-service"
  logs: "https://grafana.internal/explore?query={app=order-service}|=error"

Include the current value, the threshold, the impact, a runbook link, a dashboard link, and a log query link. The on-call engineer should be able to go from "phone buzzing" to "actively investigating" in under a minute.

Alerting Tool and Service Costs

Most alerting tooling is bundled with monitoring platforms, but here's what dedicated alerting costs look like:

Tool/Service	Monthly Cost	What You Get
Alertmanager (open source)	$0 + infrastructure	Routing, grouping, silencing, inhibition
PagerDuty	$21-41/user	On-call scheduling, escalations, incident management
Opsgenie	$9-35/user	On-call scheduling, escalations, integrations
Grafana OnCall (open source)	$0 + infrastructure	On-call scheduling, escalations, Grafana-native
Grafana OnCall (Cloud)	Included in Pro plan	Managed on-call with Grafana integration

PagerDuty is the industry standard for on-call management, and for good reason -- its escalation policies, scheduling, and mobile app are mature. But at $41/user for the Business plan, costs add up. Grafana OnCall is a strong open-source alternative if you're already in the Grafana ecosystem.

Frequently Asked Questions

How many alerts should a team have?

There's no magic number, but a healthy range is 5-15 alerts per service. Below 5, you're probably missing important failure modes. Above 15, you likely have redundant or overly specific alerts that should be consolidated. The real metric is pages per on-call shift: if your on-call engineer gets paged more than 2-3 times per week for genuine incidents, you have a reliability problem, not an alerting problem.

Should I alert on log patterns or metrics?

Prefer metrics for alerting. Metrics are numeric, cheap to evaluate, and designed for aggregation. Log-based alerts require scanning text, which is slower and more expensive. Use log-based alerts only for conditions that can't be expressed as metrics -- specific error messages, audit events, or security patterns. If you find yourself creating many log-based alerts, that's a sign you need to emit those conditions as metrics instead.

What is alert fatigue and how do I prevent it?

Alert fatigue occurs when engineers receive so many alerts that they start ignoring them. Prevention requires discipline: delete alerts that don't require action, increase hysteresis durations, use grouping to reduce notification volume, and review alert history monthly. The litmus test is simple -- if an on-call engineer routinely ignores alerts without investigating, you have alert fatigue and your alerting system is providing negative value.

How do I handle flapping alerts?

Flapping alerts fire and resolve repeatedly in quick succession. Fix them by increasing the for duration (require the condition to persist longer), widening the aggregation window (use rate()[10m] instead of rate()[1m]), or adding a hysteresis band (fire at 5% error rate, resolve at 2%). If the underlying condition genuinely fluctuates around the threshold, the threshold itself may be wrong.

Should I use static thresholds or dynamic baselines?

Static thresholds work for most teams. They're simple, predictable, and easy to debug. Dynamic baselines (anomaly detection) sound appealing but produce false positives during legitimate traffic changes -- deployments, marketing campaigns, seasonal patterns. If you do use dynamic baselines, combine them with static thresholds as a safety net. Alert when the metric is both anomalous AND exceeding a static floor.

How do I alert on services with low traffic?

Low-traffic services break standard rate-based alerts because a single error can spike the error rate to 50%. Two approaches work: use absolute error counts instead of rates (alert when errors exceed 5 in 10 minutes), or use longer evaluation windows (rate over 1 hour instead of 5 minutes). For very low traffic, synthetic monitoring (periodic health checks) is more reliable than traffic-based alerting.

What should my alert severity levels be?

Keep it simple with three levels. Critical: pages the on-call, requires immediate response, represents active user impact. Warning: sends to Slack, requires attention within business hours, represents degradation that will escalate if ignored. Info: logged to a dashboard, no notification, used for tracking trends. Avoid creating more than three levels -- additional granularity just creates confusion about response expectations.

Conclusion

Good alerting is a practice, not a configuration. It requires ongoing maintenance: monthly reviews to delete noisy alerts, post-incident follow-ups to add missing alerts, and continuous tuning of thresholds and durations. The five-question checklist is your starting point: every alert must indicate user impact, require human action, provide enough context for diagnosis, fire at a sustainable frequency, and link to a runbook.

Implement hysteresis on everything. Use multi-window burn rate alerts for SLOs. Route critical alerts to PagerDuty and warnings to Slack. Include dashboard links, log queries, and runbook URLs in every alert annotation. And most importantly, treat alert noise as a bug -- every false positive erodes trust in the system and increases the chance that someone ignores the real thing.

Alerting Done Right: Reducing Noise and Writing Actionable Alerts