Skip to content
CI/CD

Progressive Delivery with Argo Rollouts: Canary + Analysis

Argo Rollouts replaces Kubernetes Deployments with a CRD that does weighted canary, metric-gated analysis, and automatic rollback. Production recipe, Prometheus AnalysisTemplates, and a side-by-side with Flagger.

A
Abhishek Patel15 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Progressive Delivery with Argo Rollouts: Canary + Analysis
Progressive Delivery with Argo Rollouts: Canary + Analysis

Why a Plain Kubernetes Deployment Isn't Enough

A vanilla Kubernetes Deployment does a rolling update and calls it a day. There's no traffic shaping, no metric gating, no pause-and-analyze. If the new ReplicaSet crashes on startup, Kubernetes will happily try again on the next pod. If it accepts traffic but silently returns 500s on 3% of requests, the rollout completes and your users suffer until PagerDuty fires. Argo Rollouts is the CNCF-incubating controller that closes this gap — it introduces a Rollout CRD that replaces your Deployment and adds progressive delivery primitives: weighted canary steps, pause windows, AnalysisRun jobs that query Prometheus, and automatic rollback when error-rate or latency thresholds are breached.

This guide walks through a production-grade canary recipe — 5% / 25% / 50% / 100% traffic steps with 5-minute analysis windows, Prometheus queries that actually catch regressions, and a comparison with Flagger so you know which tool fits your stack. The advanced patterns — multi-metric composite gates, experiment strategies, and VPC Lattice traffic splitting — are in a follow-up I send to the newsletter.

What Argo Rollouts Adds on Top of Deployments

Definition: Argo Rollouts is a Kubernetes controller that replaces the native Deployment resource with a Rollout CRD, giving you step-based canary and blue-green strategies, automated metric analysis, and native integration with service meshes and ingress controllers for traffic splitting.

The controller runs as a Deployment in the argo-rollouts namespace and watches Rollout, AnalysisTemplate, and AnalysisRun objects. When you push a new image tag to a Rollout, it doesn't just replace pods — it walks through an ordered strategy.canary.steps list, pausing between steps and optionally firing an AnalysisRun that queries your metrics backend.

Rollout vs Deployment — What Actually Changes

CapabilityDeploymentRollout
Weighted traffic splittingNo (pod-count proxy only)Yes (mesh / ingress driven)
Pause between stepsNoYes (timed or indefinite)
Metric-gated promotionNoYes (AnalysisRun)
Automatic rollback on SLO breachNoYes
Blue-green with preview serviceManualNative
Experiment (A/B) podsNoYes (Experiment CRD)
Promotion hooks / manual gateNoYes (argo rollouts promote)

The trade-off is you migrate your YAML. Every Deployment becomes a Rollout with apiVersion: argoproj.io/v1alpha1. Argo ships a reference-based migration mode that keeps your existing Deployment but hands off scaling to the controller — useful for a gradual move, awkward long-term. Most teams rip the band-aid and convert.

Production Canary Recipe: 5% / 25% / 50% / 100% With Analysis

This is the canary ladder I run in production. Four traffic steps, each followed by a 5-minute analysis window that must pass before the controller moves to the next weight. Total wall-clock time from deploy to 100%: about 25 minutes. That's long enough to catch slow-burn regressions (memory leaks, connection-pool exhaustion) but short enough that release cadence doesn't suffer.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-api
  namespace: prod
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: checkout-api-canary
      stableService: checkout-api-stable
      trafficRouting:
        istio:
          virtualService:
            name: checkout-api
            routes:
              - primary
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate-latency
            args:
              - name: service-name
                value: checkout-api-canary
        - setWeight: 25
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate-latency
        - setWeight: 50
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate-latency
        - setWeight: 100
  selector:
    matchLabels:
      app: checkout-api
  template:
    metadata:
      labels:
        app: checkout-api
    spec:
      containers:
        - name: api
          image: myregistry/checkout-api:v2.4.1
          ports:
            - containerPort: 8080

The analysis step references an AnalysisTemplate that runs in the background against the canary service. If any metric breaches its threshold, the template marks the AnalysisRun as Failed, which aborts the rollout and rolls weight back to 0% on the canary. Canary deployment fundamentals cover why these specific percentages map well to real error distributions — 5% gives enough signal in 5 minutes without exposing too much blast radius.

Prometheus AnalysisTemplate That Actually Catches Regressions

The default "error rate < 1%" check is a beginner trap. It fires on real incidents but also on every transient blip, and it misses regressions that degrade latency without bumping error count. A good AnalysisTemplate checks three things: success rate, p99 latency, and a business-level signal (checkout completion, payment success — whatever you have).

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-latency
  namespace: prod
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 30s
      count: 10
      successCondition: result[0] >= 0.99
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",code!~"5.."}[1m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[1m]))
    - name: p99-latency
      interval: 30s
      count: 10
      successCondition: result[0] <= 0.500
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[1m]))
              by (le)
            )
    - name: checkout-completion
      interval: 60s
      count: 5
      successCondition: result[0] >= 0.92
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(checkout_completed_total{service="{{args.service-name}}"}[2m]))
            /
            sum(rate(checkout_started_total{service="{{args.service-name}}"}[2m]))

Watch out: count is the number of measurements, not the window length. With interval: 30s and count: 10, you're measuring for 5 minutes. failureLimit: 2 means 2 bad measurements abort the run — set this higher than 1 or normal traffic noise will cause false rollbacks. Running Prometheus and Grafana at 1s scrape for your canary namespace gives you denser data than the default 15s and cuts detection latency.

The business-level metric is what separates a mature canary from a theatrical one. You'll occasionally catch a deploy where HTTP 200s stay flat, p99 latency stays flat, but checkout completion drops 8% because a form field got renamed. That's the regression the first two checks miss.

Traffic Splitting: Istio, Linkerd, NGINX, and Gateway API

Argo Rollouts doesn't move packets itself — it tells your mesh or ingress to reweight. The integration lives under strategy.canary.trafficRouting and supports four production-grade providers. Each has sharp edges worth knowing before you pick.

flowchart LR
  Git[Git push: image v2] --> ArgoCD[ArgoCD sync]
  ArgoCD --> Rollout[Rollout CR]
  Rollout --> Controller[Argo Rollouts controller]
  Controller --> VS[Istio VirtualService]
  Controller --> Analysis[AnalysisRun]
  Analysis --> Prom[(Prometheus)]
  Prom -->|success/fail| Controller
  VS --> Stable[Stable pods]
  VS --> Canary[Canary pods]

Istio

The most feature-complete integration. Argo Rollouts patches an existing VirtualService's HTTP route weights in place. Supports header-based and mirror-based canaries. Works with Ambient mode as of Istio 1.22. Downside: you need Istio control-plane expertise, and the service mesh overhead tax is real — expect 2-4ms added p99 latency and ~100MB memory per sidecar.

Linkerd (via SMI TrafficSplit)

Lighter than Istio, uses the SMI TrafficSplit CRD. Header-based routing is limited compared to Istio's VirtualService. The win is operational simplicity — Linkerd's default config is production-ready without tuning.

NGINX Ingress

Good choice if you already run NGINX and don't want a mesh. Argo patches the nginx.ingress.kubernetes.io/canary-weight annotation. Header- and cookie-based canaries are supported. Limitation: no traffic mirroring, no mTLS between canary and stable. NGINX Ingress on Kubernetes covers the controller's baseline performance, which applies here.

Gateway API

The newest option, stable as of Argo Rollouts 1.7. Patches HTTPRoute backend weights. This is the right choice for greenfield clusters — Gateway API vs Ingress explains why the Gateway / HTTPRoute split gives you cleaner RBAC and traffic policy separation. Expect this to be the default in 2027.

Step-by-Step: Installing Argo Rollouts and Running Your First Canary

This is the minimal path from zero to a working canary on a fresh cluster. Assumes you have kubectl configured and a Prometheus install already scraping your services.

  1. Install the controller. kubectl create namespace argo-rollouts then kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml. Takes about 90 seconds to reach Ready.
  2. Install the kubectl plugin. Grab the binary from the Argo Rollouts releases page, chmod +x kubectl-argo-rollouts-darwin-amd64, move to /usr/local/bin/kubectl-argo-rollouts. Now kubectl argo rollouts works.
  3. Create two Services. You need a stable and canary service pointing at the same pod selector. The Rollouts controller will patch the pod template hash on each service to direct traffic correctly.
  4. Convert your Deployment to a Rollout. Change kind: Deployment to kind: Rollout, apiVersion: apps/v1 to apiVersion: argoproj.io/v1alpha1, add the strategy.canary block. The pod spec itself is identical.
  5. Apply the AnalysisTemplate. Ship the Prometheus template from the previous section. Verify it loads with kubectl get analysistemplates -n prod.
  6. Trigger a rollout. Patch the image: kubectl argo rollouts set image checkout-api api=myregistry/checkout-api:v2.4.1. Watch progress: kubectl argo rollouts get rollout checkout-api --watch.
  7. Promote or abort manually if needed. kubectl argo rollouts promote checkout-api skips the next pause. kubectl argo rollouts abort checkout-api forces rollback.

Total setup time for a team already running ArgoCD: about 2 hours, most of it writing AnalysisTemplates for your specific metrics. Teams without GitOps first should read ArgoCD vs FluxCD — Rollouts pairs naturally with ArgoCD because both are argoproj maintained.

Argo Rollouts vs Flagger: Which Should You Pick

These two are the only production-credible options. I've run both — Flagger for 14 months at a previous shop on Linkerd, Argo Rollouts for the last 2 years on Istio and NGINX. They solve the same problem from different angles.

Side-by-Side Comparison

DimensionArgo RolloutsFlagger
GovernanceCNCF IncubatingCNCF Graduated
Primary ecosystemArgo (ArgoCD)Flux (FluxCD)
CRD modelReplaces Deployment with RolloutReads existing Deployment, adds Canary CR
Manifest migration effortEvery Deployment rewrittenZero manifest changes
Step granularityExplicit (setWeight, pause, analysis)Declarative (maxWeight, stepWeight, threshold)
UIArgoCD rollout extensionNone (Grafana dashboards only)
Mesh supportIstio, Linkerd, SMI, App Mesh, TraefikIstio, Linkerd, App Mesh, Kuma, NGINX, Gloo, Skipper, Contour, Gateway API
Experiment / A/B podsYes (Experiment CRD)No
Manual promotion gateYes (indefinite pause)No (automated only)
Webhook extensibilityLimitedExtensive (pre-rollout, rollout, confirm, post-rollout)
Resource footprint (idle)~200MB controller~80MB controller

When Argo Rollouts Wins

Step-based explicit control. You want to say "canary to 5%, wait 5 minutes, run analysis, if good wait for a human to click promote, then go to 25%". Flagger can't pause indefinitely for manual approval — its model is fully automated. Argo also wins on the Experiment CRD (running parallel A/B pods for real comparative analysis, separate from the canary path) and on any team already invested in ArgoCD.

When Flagger Still Wins

Zero-manifest adoption is real. If you have 80 microservices and rewriting every Deployment is a political nightmare, Flagger drops in without touching team YAML. It also has broader mesh support (9 providers vs 5) and the webhook system is more powerful if you want custom load-testing or Slack approval hooks. Linkerd shops in particular tend to reach for Flagger because the Buoyant-maintained Linkerd-Flagger integration is tighter than Argo's.

Common Failure Modes and How to Fix Them

Three things will bite you in the first month of running Rollouts. Knowing them in advance saves a weekend.

AnalysisRun Flaps Due to Sparse Traffic

If your canary service sees fewer than ~50 requests in the analysis window, the success-rate query returns NaN or wild swings on single failures. Fix: add a minimum request-rate guard (sum(rate(http_requests_total[1m])) > 0.5) as a prerequisite metric, or increase failureLimit so a single bad sample doesn't abort. For truly low-traffic services (internal admin tools), skip analysis entirely and rely on timed pauses.

The Canary Gets Promoted Too Fast

Default pause durations feel short. A 2-minute analysis catches 90% of regressions, but memory leaks and connection-pool exhaustion often take 10-20 minutes to manifest. For user-facing services, I run at least one 15-minute analysis window before hitting 100%. Blue-green deployments are an alternative if your regression pattern is "works for 30 minutes then explodes".

Stable and Canary Pods Diverge in Config

Argo Rollouts uses the pod template hash to route traffic, which means the two ReplicaSets must differ only in the image. If you bump a ConfigMap and the Deployment doesn't reference it by hash, both stable and canary pods pick up the new config simultaneously — defeating the canary. Use ConfigMap hash annotations or switch to immutable ConfigMaps with versioned names.

Experiment and Blue-Green: The Other Two Strategies

Canary is the workhorse, but Argo Rollouts ships two more strategies worth knowing.

Blue-Green

All-at-once cutover with a preview service. You deploy a full copy of the new version behind previewService, run smoke tests against it, then flip activeService when you're happy. Zero traffic gradient — it's either blue or green. Use it when per-request routing is hard (stateful protocols, websockets with session affinity) or when compliance requires "full parity before cutover".

Experiment (A/B)

A Rollout step can launch an Experiment that runs N parallel pods of an alternative version, independently of the canary pipeline. You use it to compare feature variants, not to ship the new version. Combined with feature flags, this is the right primitive for real A/B testing — keep the code path in a flag, run the experiment to measure, then ship at 100% via canary when the flag wins.

Prometheus Queries You Actually Want in Your Analysis Templates

A starter pack I've refined over four companies. Copy-paste into your AnalysisTemplate, change the service label, and you have something meaningfully better than the default "error rate" check.

# Success rate (non-5xx over total)
sum(rate(http_requests_total{service="$SVC",code!~"5.."}[2m]))
/
sum(rate(http_requests_total{service="$SVC"}[2m]))

# p99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="$SVC"}[2m])) by (le)
)

# Saturation — in-flight request ratio vs configured concurrency limit
avg(
  sum(http_requests_in_flight{service="$SVC"}) by (pod)
  /
  on(pod) max_over_time(container_spec_cpu_quota{container="$SVC"}[5m])
)

# Error-budget burn rate (14-day window) — gates on SLO health not raw error rate
(
  sum(rate(http_requests_total{service="$SVC",code=~"5.."}[2m]))
  /
  sum(rate(http_requests_total{service="$SVC"}[2m]))
) / 0.001  # where 0.001 is your error budget (99.9% SLO)

# Goroutine leak guard (Go services)
rate(go_goroutines{service="$SVC"}[5m])

The error-budget burn-rate query is what the Google SRE workbook recommends for alerts — use it in analysis too. A burn rate above 14x your budget means you'd exhaust the monthly budget in 2 days at current pace, which is a solid rollback signal. Teams with formal SLOs and error budgets have a natural home for this metric.

Production Operations: Dashboards, Alerts, and Runbooks

Shipping Rollouts without operational hygiene is the "we have Kubernetes" mistake repeated. Three things you want in place before the first production canary:

  • Grafana dashboard tracking rollout_info, rollout_phase, and analysis_run_metric_phase. The official dashboard JSON is a starting point; add a panel per environment.
  • Alert on stuck rollouts — a Rollout in Paused for more than 2 hours usually means a human forgot to promote. kube_rollout_phase{phase="Paused"} == 1 and time() - kube_rollout_updated_timestamp > 7200.
  • Runbook for aborted rollouts — when an AnalysisRun fails, on-call needs a 60-second answer for "what do I look at". Link the runbook from the failure alert, include the Rollout name, the failing metric, and the two dashboards (service and cluster).

Teams already running Prometheus and Grafana have most of this plumbing; Rollouts adds about six custom metrics you'll want in one dashboard panel. I've seen the investment pay back within the first month — caught two silent regressions that would have shipped under plain Deployments.

FAQ

How does Argo Rollouts canary analysis work?

You define an AnalysisTemplate with Prometheus (or Datadog, CloudWatch, Wavefront) queries and a successCondition. During a canary step, the controller creates an AnalysisRun that executes the queries at the specified interval. If failureLimit bad measurements accumulate, the rollout aborts and traffic returns to the stable version. Success promotes to the next step.

What is the difference between Argo Rollouts and ArgoCD?

ArgoCD is a GitOps controller that syncs cluster state to match a Git repo. Argo Rollouts is a progressive delivery controller that handles canary and blue-green strategies. They are separate projects under the argoproj umbrella. ArgoCD tells the cluster "apply this YAML"; Rollouts then decides how to roll it out safely. Most teams run both.

Is Argo Rollouts better than Flagger?

Neither is strictly better. Argo Rollouts wins on explicit step-based control and tight ArgoCD integration. Flagger wins on zero-manifest adoption (keeps your Deployments intact) and broader service-mesh coverage. Choose Rollouts if you want manual promotion gates and Experiment pods; choose Flagger if you are already a Flux shop or can't rewrite every Deployment.

Does Argo Rollouts need a service mesh?

No. You can run Argo Rollouts with just NGINX Ingress or the newer Gateway API and get weighted traffic splitting without a full mesh. A mesh (Istio, Linkerd) unlocks header-based canaries, mTLS between canary and stable, and traffic mirroring — but the baseline canary workflow works fine without one.

How long should a canary analysis window be?

5 minutes per step for typical web services, 15 minutes before 100% promotion for user-facing critical paths. Shorter windows miss slow-burn regressions (memory leaks, connection-pool exhaustion). Longer windows slow release velocity without catching more real issues. Adjust based on your service's error-budget burn rate — faster burn means you can commit faster.

Can Argo Rollouts do blue-green deployments?

Yes. The strategy.blueGreen block replaces strategy.canary and requires a previewService and activeService. The new version deploys fully behind preview, you run smoke tests, then promote flips the active service. No gradual traffic shift — it's all-or-nothing, which is the right model for stateful protocols and strict compliance environments.

What metrics providers does Argo Rollouts support?

Prometheus, Datadog, New Relic, Wavefront, CloudWatch, Graphite, InfluxDB, Kayenta (Spinnaker-style multi-source analysis), SkyWalking, and custom Kubernetes Jobs or web provider (arbitrary HTTP endpoints). Prometheus is the most common; the others are useful when your observability stack is already committed elsewhere.

Wrap Up: When to Adopt Argo Rollouts

Progressive delivery with Argo Rollouts is the right move when you've outgrown the "deploy and pray" model of plain Kubernetes Deployments — specifically, when a bad deploy costs you real money or real users and your team has the operational maturity to run Prometheus-backed analysis. The CRD migration is real work, but the payoff is measured in prevented incidents. If you already run ArgoCD, adding Rollouts is the next natural step. If you run FluxCD, Flagger fits better. And if you're still on manual kubectl apply, fix that first — progressive delivery on top of ad-hoc deploys is putting a turbo on a car with no brakes.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.