Skip to content
Observability

Prometheus and Grafana: Setting Up Your First Monitoring Stack

Deploy Prometheus and Grafana on Kubernetes using Helm. Learn the pull-based scrape model, PromQL essentials (rate, histogram_quantile, aggregation), Grafana dashboard design, recording rules, and Alertmanager routing.

A
Abhishek Patel9 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Prometheus and Grafana: Setting Up Your First Monitoring Stack
Prometheus and Grafana: Setting Up Your First Monitoring Stack

A 13-Year Timeline: How We Ended Up With Prometheus

If you started running servers in the early 2000s, you ran Nagios. It worked, kind of, if you enjoyed writing Perl plugins. By 2010 Graphite had rewritten the graphing story with Carbon and Whisper files, and teams layered Statsd in front to push metrics -- the push model that seemed sensible until your Statsd server became the bottleneck. In 2012 SoundCloud's engineers looked at Graphite plus Statsd plus Ganglia plus Munin plus Nagios and said "no more." They built Prometheus, released it publicly in 2015, and donated it to the CNCF in 2016.

Here is the compressed timeline:

  • 1999 -- Nagios ships. Push-based, Perl plugins, still running in some hospitals.
  • 2008 -- Graphite ships at Orbitz. Carbon + Whisper + whisper-aggregator. Pull-nothing; everything pushes in.
  • 2011 -- Etsy open-sources Statsd. The push model hits its ceiling -- cardinality is unbounded, UDP silently drops packets, and the aggregator is a SPoF.
  • 2012 -- SoundCloud engineers start Prometheus, inspired by Google's internal Borgmon. Pull model, dimensional data, PromQL.
  • 2015 -- Prometheus 1.0 released publicly. The Kubernetes project adopts it almost immediately.
  • 2016 -- Prometheus becomes the second CNCF project (after Kubernetes). Grafana Labs pivots their entire business around it.
  • 2018 -- Prometheus graduates CNCF. kube-prometheus-stack makes "one Helm install" a reality.
  • 2020-2025 -- Thanos, Cortex, and Grafana Mimir solve the long-term storage problem. OpenTelemetry emerges as a metrics producer that speaks Prometheus natively.
  • 2026 (today) -- Prometheus + Grafana is the default monitoring stack for Kubernetes. If you are not using it, you are either deliberately paying Datadog or you have not gotten around to installing it yet.

Every design choice in Prometheus -- the pull model, the dimensional data model, the deliberately limited local storage, PromQL -- is a reaction to something that hurt in Graphite or Nagios. Understanding that history makes the tool make sense. This guide is the 2026 playbook for deploying the stack on Kubernetes, writing PromQL that actually answers production questions, and avoiding the cardinality disasters that still kill one in three Prometheus deployments.

Setting Up Prometheus with Helm

The fastest path to a production-ready Prometheus installation on Kubernetes is the kube-prometheus-stack Helm chart. It bundles Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics into a single deployment.

Step-by-Step Deployment

  1. Add the Helm repository.
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
  2. Create a values file for your environment.
    # values-production.yaml
    prometheus:
      prometheusSpec:
        retention: 15d
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: gp3
              resources:
                requests:
                  storage: 50Gi
        resources:
          requests:
            memory: 2Gi
            cpu: 500m
          limits:
            memory: 4Gi
    
    grafana:
      adminPassword: ${GRAFANA_ADMIN_PASSWORD}
      persistence:
        enabled: true
        size: 10Gi
    
    alertmanager:
      alertmanagerSpec:
        retention: 120h
  3. Install the chart.
    helm install monitoring prometheus-community/kube-prometheus-stack \
      --namespace monitoring \
      --create-namespace \
      -f values-production.yaml
  4. Verify the installation.
    kubectl get pods -n monitoring
    kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring
  5. Access Grafana at http://localhost:3000. The kube-prometheus-stack ships with pre-built dashboards for Kubernetes cluster health, node metrics, and pod resource usage.

Pro tip: Don't skip the storage configuration. Prometheus without persistent volumes loses all historical data on pod restart. Use a storage class with good IOPS -- Prometheus writes heavily during compaction.

Definition sidebar: Prometheus is a CNCF-graduated, pull-based monitoring system that periodically scrapes HTTP /metrics endpoints from instrumented targets, stores the resulting time-series in a local TSDB, evaluates alert rules and recording rules in PromQL, and exposes a query API that Grafana and Alertmanager consume. It intentionally does not ship with long-term storage -- that is Thanos's, Cortex's, or Mimir's job.

The Scrape Model Explained

Prometheus discovers targets through service discovery and periodically scrapes their /metrics endpoint. In Kubernetes, this happens automatically via annotations or ServiceMonitor custom resources.

# ServiceMonitor for a custom application
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: monitoring
  labels:
    release: monitoring
spec:
  namespaceSelector:
    matchNames:
      - production
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

The interval: 15s setting means Prometheus hits this endpoint every 15 seconds. That's a good default for most services. Going lower (5s) increases storage and CPU usage. Going higher (60s) means you miss short-lived spikes.

What a Scrape Target Looks Like

Your application's /metrics endpoint returns plain text in the Prometheus exposition format:

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/users",status="200"} 14523
http_requests_total{method="POST",endpoint="/api/orders",status="201"} 892
http_requests_total{method="GET",endpoint="/api/users",status="500"} 37

# HELP http_request_duration_seconds Request latency histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{endpoint="/api/users",le="0.01"} 11234
http_request_duration_seconds_bucket{endpoint="/api/users",le="0.05"} 13876
http_request_duration_seconds_bucket{endpoint="/api/users",le="0.1"} 14401
http_request_duration_seconds_bucket{endpoint="/api/users",le="+Inf"} 14523
http_request_duration_seconds_sum{endpoint="/api/users"} 312.47
http_request_duration_seconds_count{endpoint="/api/users"} 14523

PromQL: The Query Language

PromQL is what makes Prometheus powerful -- and what trips up most newcomers. It's a functional query language designed specifically for time-series data. Here are the three patterns you'll use constantly.

Rate: Turning Counters into Useful Data

Counters only go up. A raw counter value like http_requests_total = 14523 isn't useful by itself. The rate() function calculates the per-second rate of increase over a time window:

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

Watch out: Never use rate() with a range shorter than twice your scrape interval. If you scrape every 15s, your minimum range should be 30s. Using rate(metric[15s]) with a 15s scrape interval produces unreliable results because you might only have one data point in the window.

histogram_quantile: Latency Percentiles

The most common use of histograms is calculating latency percentiles:

# 99th percentile request duration
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# 95th percentile, broken down by endpoint
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

Aggregation: Combining Time Series

# Total request rate across all pods
sum(rate(http_requests_total[5m]))

# Request rate per service
sum by (service) (rate(http_requests_total[5m]))

# Top 5 endpoints by error rate
topk(5, sum by (endpoint) (rate(http_requests_total{status=~"5.."}[5m])))

Grafana Dashboards That Actually Help

Grafana is only as good as the dashboards you build. I've seen teams with 50 dashboards where nobody looks at any of them. Here's the approach that works: build three dashboards and make them great.

DashboardPurposeKey Panels
Service OverviewRED metrics for all servicesRequest rate, error rate, p50/p95/p99 latency per service
InfrastructureUSE metrics for nodes and podsCPU utilization, memory usage, disk I/O, network bandwidth
Alerts OverviewCurrent firing and pending alertsAlert status table, recent alert history, silence management

Pro tip: Use Grafana's template variables to make dashboards interactive. A single dropdown for namespace and service turns one dashboard into a view for every service in your cluster, instead of duplicating dashboards per team.

Recording Rules: Pre-Computing Expensive Queries

Some PromQL queries are expensive. If your Grafana dashboard computes histogram_quantile across thousands of time series every time someone loads the page, Prometheus will struggle. Recording rules pre-compute these expressions and store the results as new time series.

# recording-rules.yaml
groups:
  - name: service_slis
    interval: 30s
    rules:
      - record: service:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )
      - record: service:http_requests:error_rate_5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
      - record: service:http_requests:rate_5m
        expr: |
          sum(rate(http_requests_total[5m])) by (service)

Now your Grafana dashboards query service:http_requests:rate_5m instead of computing the aggregation on every load. The naming convention level:metric:operations is standard in the Prometheus community -- stick to it.

Alertmanager: Routing Alerts

Prometheus evaluates alert rules and fires them to Alertmanager. Alertmanager handles deduplication, grouping, silencing, and routing to the right channel (Slack, PagerDuty, email, webhooks).

# alertmanager-config.yaml
route:
  receiver: default-slack
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: slack-warnings

receivers:
  - name: default-slack
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: ${PD_SERVICE_KEY}
  - name: slack-warnings
    slack_configs:
      - channel: '#alerts-warnings'

Cost Comparison: Self-Hosted vs. SaaS

Running your own Prometheus stack isn't free. Here's a realistic cost breakdown for a medium-sized deployment (50 nodes, 2M active time series):

ApproachMonthly Cost (est.)Operational BurdenRetention
Self-hosted kube-prometheus-stack$200-500 (compute + storage)High -- you manage upgrades, scaling, storageConfigurable (15d-1y+)
Grafana Cloud Free$0None14d metrics, 10K series limit
Grafana Cloud Pro$300-800Low13 months
Datadog Infrastructure$1,150+ ($23/host)None15 months
New Relic$500-1,500 (data ingest based)NoneVaries by plan

Self-hosting is cheapest at scale but requires dedicated SRE time. Grafana Cloud hits a sweet spot for teams that want the Prometheus/Grafana ecosystem without the operational burden. Datadog and New Relic are more expensive but provide all-in-one platforms with logs, traces, and APM bundled in.

Failure Modes: What Breaks Prometheus in Production

Prometheus is robust but opinionated. Most production outages I have debugged land in the same handful of buckets.

Cardinality Explosion From a New Label

A developer adds user_id or request_id to a metric label. Overnight, active time series goes from 500K to 50M, Prometheus runs out of memory, and the pod enters CrashLoopBackOff. Prevention: enforce a cardinality budget with prometheus_tsdb_symbol_table_size_bytes alerts and use promtool tsdb analyze weekly to find the top 10 metrics by cardinality.

Scrape Timeouts Under Load

At high scrape rates, a slow /metrics endpoint (e.g., an exporter that computes 10K metrics on demand) times out mid-scrape. Prometheus marks the target up = 0 and alerts fire even though the service is fine. Fix: raise scrape_timeout to 30s or break the exporter into multiple endpoints.

Storage Compaction Blocks Writes

Prometheus runs 2-hour compactions that merge small TSDB blocks into larger ones. On underspec'd disks (low IOPS), compaction takes longer than the next scheduled run and backpressures writes. Symptom: scrape queue fills up and samples drop. Fix: use an SSD-backed storage class with >= 3,000 IOPS, and allocate at least 2 cores and 4 GiB just for Prometheus.

Recording Rule Circular Dependencies

A recording rule that depends on another recording rule that depends on the first works fine -- until they are evaluated in the wrong order and return stale data. Always name rules with the level:metric:operations convention and keep dependencies one-way.

Alertmanager Routing Silently Drops Alerts

An alert with severity: critical but no matching route in alertmanager.yml falls through to the default receiver, which is often undefined or mis-configured. Catch this with a Watchdog alert that fires continuously -- if it stops reaching your phone, your paging pipeline is broken, not your app.

Grafana Dashboard Panics on Long Range Queries

A dashboard with a 30-day range and 1-minute resolution generates a query returning ~43K data points per series. Multiply by 50 series in a panel and Prometheus OOMs. Use recording rules to pre-aggregate and cap dashboard ranges at reasonable resolutions.

Frequently Asked Questions

How much storage does Prometheus need?

A rough formula: bytes per sample (1-2 bytes after compression) multiplied by active time series multiplied by samples per day. For 1 million active series scraped every 15 seconds, expect roughly 10-15 GB per day after compaction. Retention of 15 days needs about 150-225 GB. Always over-provision by 30% -- cardinality explosions from bad labels can spike storage quickly.

Should I use Prometheus or Thanos for long-term storage?

Prometheus itself is designed for short-to-medium retention (days to weeks). For long-term storage beyond 30 days, use Thanos or Cortex (now Mimir). Thanos adds a sidecar to Prometheus that uploads compacted blocks to object storage (S3, GCS). Grafana Mimir is the newer option and is simpler to operate. Either one gives you years of retention at object storage pricing.

What's the difference between Prometheus and InfluxDB?

Prometheus uses a pull model and PromQL, is optimized for reliability (works even when the network is partitioned), and is tightly integrated with Kubernetes. InfluxDB uses a push model and SQL-like queries, handles higher cardinality natively, and works well for IoT or custom metrics use cases. For Kubernetes monitoring, Prometheus is the clear winner due to ecosystem support.

How do I avoid high cardinality problems?

High cardinality occurs when a label has too many unique values -- like a user ID or request ID on a metric. This creates millions of time series and crashes Prometheus. The fix: never put unbounded values in metric labels. Use labels for bounded categories (HTTP method, status code, service name). Move high-cardinality data into logs or trace attributes instead.

Can Prometheus monitor non-Kubernetes workloads?

Absolutely. Prometheus supports static target configuration, DNS-based discovery, Consul, EC2, and dozens of other service discovery mechanisms. For VMs, install node-exporter and point Prometheus at it. For cloud services, use exporters -- there are community exporters for AWS, GCP, databases, message queues, and nearly every popular service.

What is the kube-prometheus-stack and why should I use it?

The kube-prometheus-stack is a Helm chart that bundles Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics with pre-configured dashboards and alert rules. It uses the Prometheus Operator to manage configuration via Kubernetes custom resources (ServiceMonitors, PrometheusRules). It saves weeks of setup and gives you production-ready defaults out of the box.

Conclusion

Prometheus and Grafana aren't going anywhere. The ecosystem is mature, the community is massive, and the operational patterns are well-documented. Start with the kube-prometheus-stack Helm chart to get running in an hour. Learn rate(), histogram_quantile(), and sum by -- those three PromQL patterns cover 80% of what you'll write. Add recording rules early to keep dashboards fast. And route your alerts through Alertmanager so critical pages go to PagerDuty while warnings go to Slack.

The biggest mistake teams make isn't choosing the wrong tool -- it's deploying Prometheus and then never building useful dashboards or alert rules on top of it. The tool is only as good as the queries and alerts you write. Invest the time in PromQL, and it pays back every on-call shift.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.