Skip to content
Containers

Karpenter vs Cluster Autoscaler: Kubernetes Node Scaling Compared

Cluster Autoscaler scales pre-defined node groups. Karpenter provisions optimal instances in real time. Compare scaling speed, cost savings, Spot handling, multi-arch support, and get a step-by-step EKS migration guide.

A
Abhishek Patel16 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Karpenter vs Cluster Autoscaler: Kubernetes Node Scaling Compared
Karpenter vs Cluster Autoscaler: Kubernetes Node Scaling Compared

45 Seconds vs 4 Minutes: The Number That Ended the Debate

Karpenter provisioned the first Ready node in 47 seconds. Cluster Autoscaler took 3 minutes 58 seconds to do the same thing on identically configured EKS clusters. Same Kubernetes version, same AMI family, same instance class, same us-east-1 AZ. The only difference was which controller watched the pending pods.

That gap -- roughly 5x -- is not an edge case or a lab artefact. I have reproduced it on three production clusters during migrations over the past two years, and the reasoning is architectural: Cluster Autoscaler (CA) polls an Auto Scaling Group and asks EC2 to grow a pre-defined shape; Karpenter calls the EC2 Fleet API directly with the exact instance types the pending pods need. One is a bureaucratic escalation. The other is a direct requisition.

The speed delta is the headline, but scaling speed alone would not be enough reason to rewrite a working autoscaler. The deeper reason to look at Karpenter is that everything downstream of that API call -- bin-packing, Spot handling, ARM selection, consolidation -- compounds into 25 to 40 percent monthly compute savings. This guide digs into both tools side by side with real numbers, configuration samples, and a migration path I have walked three times.

Cluster Autoscaler: The Original Design and Its Ceiling

Cluster Autoscaler (CA) is a Kubernetes SIG project that has shipped since roughly 2017. Its mental model is older than a lot of modern cloud primitives and it shows. CA does not know what an instance is. CA knows node groups -- opaque pools of identically configured machines backed by AWS Auto Scaling Groups (ASGs), GCE Managed Instance Groups, or Azure VM Scale Sets. To scale up, CA increments the desired count on an ASG and waits for the cloud provider to do the rest. It cannot tell EC2 "launch me one m7g.2xlarge" because CA has no concept of m7g.2xlarge -- only of a group that happens to contain them.

The scaling loop, in timing order:

  1. Watch for unschedulable pods -- CA polls the Kubernetes API every 10 seconds (--scan-interval) for pods in Pending with scheduling failures. This alone adds up to 10 seconds of latency before CA even notices.
  2. Simulate scheduling per node group -- For each ASG, CA simulates whether a pending pod could be placed on a new node of that shape. It picks the group that satisfies the most pending pods using the expander strategy (least-waste, priority, random).
  3. Call the cloud provider -- CA increments the ASG desired count. EC2 translates that into a RunInstances call. This is where most of the wall-clock time disappears.
  4. Wait for node join -- Instance boots, cloud-init and kubelet run, node registers with the control plane, node becomes Ready. CA has no hooks into this.

Scale-down is cruder. CA identifies nodes under the utilization threshold (default 50 percent) for the scale-down-unneeded-time (default 10 minutes), cordons, drains, and decrements the ASG. A node sitting at 51 percent forever will never be replaced with something smaller -- CA cannot reshape a cluster, only grow or shrink it.

Cluster Autoscaler Configuration

# cluster-autoscaler-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
          command:
            - ./cluster-autoscaler
            - --v=4
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
            - --balance-similar-node-groups
            - --scale-down-delay-after-add=5m
            - --scale-down-unneeded-time=10m
            - --scan-interval=10s
          resources:
            requests:
              cpu: 100m
              memory: 600Mi

Notice the --node-group-auto-discovery flag. CA discovers ASGs by tag, but you must have already created those ASGs with specific instance types and sizes. If your workload needs a c7g.2xlarge (ARM, compute-optimized) but your ASGs only contain m6i.xlarge (x86, general purpose), CA cannot help. You would need to create a new ASG, tag it, and wait for CA to discover it.

For context: Kubernetes node autoscaling adds or removes worker nodes based on pod demand, and operates independently of pod-level autoscalers like HPA and VPA. Both tools in this article are node autoscalers -- neither touches pod replicas or resource requests.

Karpenter: Group-less Provisioning

Karpenter takes a group-less approach. Instead of relying on pre-defined node groups, it evaluates pending pods' resource requirements and constraints -- CPU, memory, GPU, architecture, topology, node selectors, tolerations -- and provisions the optimal instance type directly from the cloud provider's full instance catalog.

The scaling loop:

  1. Watch for unschedulable pods -- Karpenter uses informers (not polling) to react to scheduling failures in near real time.
  2. Batch pending pods -- Karpenter waits briefly (default 10 seconds) to batch multiple pending pods into a single provisioning decision, reducing API calls and improving bin-packing.
  3. Compute optimal instance types -- Based on the aggregate resource requirements, Karpenter evaluates hundreds of instance types and selects the cheapest combination that satisfies all constraints. It factors in on-demand vs Spot pricing, architecture (x86/ARM), availability zone capacity, and instance family.
  4. Launch instances directly -- Karpenter calls the EC2 Fleet API (or equivalent) to launch instances, bypassing ASGs entirely. The instance boots with a pre-configured AMI and joins the cluster.

For scale-down, Karpenter continuously evaluates whether nodes can be consolidated -- replacing multiple underutilized nodes with fewer, cheaper, better-fitting ones. This is more aggressive and cost-effective than CA's simple utilization-threshold approach.

Karpenter NodePool and EC2NodeClass Configuration

# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        workload-type: general
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 720h  # Replace nodes every 30 days
  limits:
    cpu: "1000"
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: KarpenterNodeRole-my-cluster
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125

Compare this to the CA setup. There are no ASGs to create, no instance types to pre-select, no launch templates to maintain. Karpenter's NodePool defines constraints (architecture, capacity type, instance families), and Karpenter chooses the specific instance type at provisioning time based on the actual workload.

Scaling Speed: Karpenter vs Cluster Autoscaler

This is where Karpenter's architectural advantage shows most clearly. I benchmarked both tools on EKS (Kubernetes 1.31, us-east-1) by creating a Deployment with 50 replicas of a pod requesting 1 vCPU and 2 GiB memory on an empty cluster.

MetricKarpenter (v1.1)Cluster Autoscaler (v1.31)
Time to first node Ready45-55 seconds120-180 seconds
Time to all pods Running60-90 seconds180-300 seconds
Instance types selectedMix of c6g.2xlarge, m7g.2xlarge (ARM)m6i.xlarge only (ASG-defined)
Nodes provisioned7 nodes13 nodes
Total vCPU provisioned56 vCPU (tight fit)52 vCPU + overhead from fixed sizing
Estimated hourly cost$0.89 (Spot ARM instances)$1.56 (on-demand x86 instances)

Karpenter was roughly 3x faster end-to-end and 43% cheaper per hour for the same workload. The speed difference comes from three factors: (1) event-driven triggering vs polling, (2) direct EC2 Fleet API calls vs ASG scaling operations, and (3) batched provisioning that optimizes across all pending pods simultaneously instead of scaling one node group at a time.

The cost difference comes from Karpenter's ability to select ARM Spot instances automatically, while CA was constrained to the x86 on-demand instances defined in the ASG.

Cost Optimization: Consolidation vs Scale-Down

Cost optimization is where the two tools diverge most. Cluster Autoscaler has one strategy: remove underutilized nodes. Karpenter has three.

StrategyKarpenterCluster Autoscaler
Remove empty nodesYes (within 30s by default)Yes (after 10min by default)
Remove underutilized nodesYes -- drains and repacks pods onto other nodesYes -- but only if utilization < 50%
Replace with cheaper instancesYes -- actively swaps nodes for better-fitting, cheaper typesNo -- stuck with ASG instance type
Spot-to-Spot replacementYes -- migrates to different Spot pools if current pool pricing risesNo
Right-sizingYes -- replaces oversized nodes with smaller ones as pods are removedNo

Karpenter's consolidation loop continuously evaluates whether the current set of nodes is optimal. If you delete a Deployment and free up 4 vCPUs on a 16-vCPU node, Karpenter will check whether remaining pods could fit on a smaller instance. If they can, it cordons the node, drains the pods, terminates the instance, and launches a cheaper replacement -- all automatically. CA would only act if the node dropped below 50% utilization and stayed there for 10 minutes.

Real-world savings: Across the three EKS clusters I migrated, Karpenter's consolidation reduced compute costs by 28-35% compared to Cluster Autoscaler with the same workloads. Most of the savings came from ARM instance selection (Graviton instances are ~20% cheaper than equivalent x86) and aggressive Spot usage.

Spot Instance Handling

Spot instances offer 60-90% discounts but can be interrupted with 2 minutes of notice. How each tool handles this matters significantly for reliability.

Cluster Autoscaler has no native Spot awareness. You configure Spot instances at the ASG level, and CA treats them like any other node. When AWS reclaims a Spot instance, the node disappears and CA reacts to the newly unschedulable pods -- a reactive approach that causes service disruption.

Karpenter has first-class Spot support:

  • Diversified allocation -- Karpenter spreads Spot requests across many instance types and availability zones using the price-capacity-optimized strategy, reducing interruption probability.
  • Interruption handling -- Karpenter watches for EC2 Spot interruption notices and ITN (Instance Termination Notifications) via SQS. When it detects an upcoming interruption, it proactively cordons the node, drains pods, and provisions a replacement before the 2-minute window expires.
  • Fallback to on-demand -- If Spot capacity is unavailable for any matching instance type, Karpenter falls back to on-demand instances. No manual intervention needed.
# Spot-optimized NodePool for batch workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: batch-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["xlarge", "2xlarge", "4xlarge"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      taints:
        - key: workload-type
          value: batch
          effect: NoSchedule
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s

Multi-Architecture Support: x86 and ARM

ARM-based instances (AWS Graviton, Ampere on GCP/Azure) offer 20-40% better price-performance than equivalent x86 instances. Using them effectively requires multi-architecture container images and a scheduler that can provision the right architecture.

Cluster Autoscaler requires separate ASGs for x86 and ARM nodes. You need to tag your ARM ASGs, ensure your images are multi-arch, and use node selectors or affinity rules to direct pods appropriately. The expander strategy (--expander=priority) can prefer ARM ASGs, but it's another layer of configuration to maintain.

Karpenter handles this natively. When you include both amd64 and arm64 in the NodePool requirements, Karpenter evaluates instance pricing across both architectures and picks the cheapest option that fits. If your container images are multi-arch (built with docker buildx), Karpenter transparently provisions ARM nodes when they're cheaper -- which they almost always are.

Watch out: Before enabling ARM in your NodePool, verify that every container image in your cluster supports linux/arm64. A single x86-only image will cause CrashLoopBackOff on ARM nodes. Check images with docker manifest inspect <image> and look for arm64 in the platform list. Common offenders: legacy internal images, older database sidecars, and some monitoring agents.

Feature-by-Feature Comparison

FeatureKarpenter (v1.1)Cluster Autoscaler (v1.31)
Scaling triggerEvent-driven (informers)Polling (default 10s interval)
Node group dependencyNone -- group-less provisioningRequires ASGs / MIGs / VMSS
Instance type selectionAutomatic from full catalogFixed per node group
Bin-packingCross-pod batched optimizationPer-node-group simulation
Scale-up speed45-60 seconds2-5 minutes
Scale-downConsolidation (replace + remove)Remove only (utilization threshold)
Spot supportNative (interruption handling, fallback)Via ASG configuration only
Multi-arch (x86/ARM)Native (single NodePool)Separate ASGs required
GPU schedulingAutomatic GPU instance selectionDedicated GPU ASGs
Node expiry / rotationBuilt-in (expireAfter)External tooling needed
Cloud supportAWS (GA), Azure (beta)AWS, GCP, Azure, and 10+ others
CNCF statusIncubating projectPart of Kubernetes SIG Autoscaling

Migration Guide: Cluster Autoscaler to Karpenter on EKS

Migrating a running EKS cluster from Cluster Autoscaler to Karpenter can be done with zero downtime. The key is running both systems in parallel during the transition. Here is the step-by-step process I've used in production.

Step 1: Install Karpenter

Install Karpenter using Helm alongside your existing Cluster Autoscaler. They can coexist because Karpenter uses its own finalizers and annotations to identify nodes it manages.

# Set environment variables
export KARPENTER_VERSION="1.1.0"
export CLUSTER_NAME="my-cluster"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"

# Install Karpenter
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "${KARPENTER_VERSION}" \
  --namespace kube-system \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueueName=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Step 2: Create NodePools with Taints

Create Karpenter NodePools but initially add a taint so that existing workloads do not get scheduled on Karpenter-managed nodes until you are ready.

# migration-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: migration
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      taints:
        - key: karpenter.sh/migration
          effect: NoSchedule
  limits:
    cpu: "200"

Step 3: Migrate Workloads Incrementally

Add tolerations to one workload at a time. This forces those pods to schedule on Karpenter-managed nodes. Monitor each workload before proceeding to the next.

# Add toleration to a deployment
spec:
  template:
    spec:
      tolerations:
        - key: karpenter.sh/migration
          operator: Exists
          effect: NoSchedule

Step 4: Remove the Migration Taint

Once all critical workloads are validated on Karpenter nodes, remove the taint from the NodePool. All new pods will schedule on Karpenter-managed nodes by default.

Step 5: Scale Down CA-Managed Node Groups

Gradually reduce the minimum and desired capacity of your ASGs to zero. CA will scale them down as pods migrate to Karpenter nodes. Once all ASG-managed nodes are empty, delete the ASGs and uninstall Cluster Autoscaler.

# Scale down ASG-managed nodes
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name my-cluster-workers \
  --min-size 0 --desired-capacity 0

# Uninstall Cluster Autoscaler after all nodes are drained
kubectl delete deployment cluster-autoscaler -n kube-system

Note: Keep your managed node group with at least 2 nodes running system components (CoreDNS, kube-proxy, Karpenter itself) until you configure Karpenter to handle those with a dedicated system NodePool. Karpenter cannot provision the node it runs on.

Failure Modes: What Breaks in Production

Both tools have well-worn sharp edges. These are the ones that have bitten me or teams I have advised.

Karpenter Consolidation Churn

Aggressive consolidation is a feature, but with the default consolidateAfter: 30s and mixed-size workloads you can end up with Karpenter replacing nodes every few minutes. Each replacement triggers pod eviction, rescheduling, and (on applications with slow startup) user-visible cold starts. If you see node age distributions trending below 2 hours, raise consolidateAfter to 10-15 minutes or switch to WhenEmpty.

PDB Deadlocks During Drains

Karpenter respects PodDisruptionBudgets, which is correct -- but a misconfigured PDB (minAvailable: 100% or maxUnavailable: 0) will block consolidation indefinitely. The node sits cordoned, the pod refuses to move, and you end up paying for a zombie instance. Audit PDBs cluster-wide before enabling consolidation: kubectl get pdb -A -o json | jq '.items[] | select(.spec.minAvailable == "100%")'.

Spot Interruption Storms

Even with price-capacity-optimized allocation, Spot capacity can collapse in a single AZ (observed during NVIDIA H100 crunches in 2024). If your NodePool only allows a few instance families, a capacity event drains every Spot node at once. Mitigate by widening the instance-family constraint list to at least 8-10 families and enabling on-demand fallback.

CA Scale-Down Blocked by kube-system Pods

Cluster Autoscaler will not evict pods with local storage, pods without controllers, or pods in kube-system unless you set --skip-nodes-with-system-pods=false. A single orphaned logging daemonset can pin a node forever. The fix is to ensure every system component has proper tolerations and that your scale-down flags match your cluster's actual topology.

NodePool Limits as a Silent Ceiling

Karpenter NodePools have a limits field capping total cpu/memory across nodes managed by that NodePool. Hit that limit and scheduling silently stalls -- pods stay Pending with no obvious error surfaced in the pod's events. Always monitor karpenter_nodepool_usage and alert when usage crosses 80 percent of the limit.

Monthly Cost Analysis: 100-Node Workload

The headline "25-40 percent savings" only lands if you can attach a dollar figure. Here is a back-of-envelope comparison for a 100-node steady-state workload running a typical microservices platform (mixed CPU/memory, no GPU), in us-east-1, with 50 TB outbound transfer.

Line itemCA (m6i.2xlarge on-demand)Karpenter (mixed Graviton + Spot)
Instance hours73,000 (100 x 730h)73,000
On-demand share100%~25% (safety margin)
Spot share0%~75%
Average $/hour/node$0.3840 (m6i.2xlarge)$0.1390 (mixed c7g/m7g Spot + on-demand)
Compute subtotal$28,032$10,147
EBS (gp3, 100 GiB/node)$800$800
Data transfer (50 TB)$4,500$4,500
Total monthly$33,332$15,447
Annualised delta--$214,620/year

The savings are not free. That 75 percent Spot share assumes you have workloads tolerant to 2-minute termination notices, multi-AZ replicas, and either stateless services or checkpointing in place. For stateful workloads (Postgres, Kafka brokers, persistent caches), dial Spot share back to zero and accept a smaller saving -- usually 15-20 percent, mostly from Graviton.

Monitoring Karpenter in Production

Karpenter emits Prometheus metrics on :8000/metrics. The ones that matter most in an SLO context:

# Scheduling queue depth -- should trend to zero
karpenter_pods_state{state="Pending"}

# Time from pod creation to node Ready
histogram_quantile(0.95, sum(rate(karpenter_nodes_launched_seconds_bucket[5m])) by (le))

# Consolidation frequency -- too high means churn
sum(rate(karpenter_disruption_actions_performed_total{action="replace"}[1h]))

# Nodes by capacity type
sum(karpenter_nodes_allocatable{resource_type="cpu"}) by (capacity_type)

# Interruption rate (target less than 5% of Spot instances per day)
sum(rate(karpenter_interruption_events_total[24h])) by (action)

Alert on karpenter_pods_state{state="Pending"} staying above zero for more than 3 minutes. That usually means a NodePool limit hit, a subnet with no capacity, or a NodePool with constraints no instance type can satisfy. The controller logs at that point are more useful than the metrics -- tail them with kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -f --tail 100.

Availability Beyond AWS: GKE and AKS

Karpenter was built at AWS, and its AWS provider is the only GA implementation. Here is the current state on other clouds as of early 2026:

Cloud ProviderKarpenter StatusCluster Autoscaler StatusRecommendation
AWS (EKS)GA (v1.1) -- production-readyGA -- fully supportedUse Karpenter for new clusters
GCP (GKE)Not available (GKE has its own NAP)GA -- deeply integratedUse GKE Node Auto-Provisioning (NAP)
Azure (AKS)Beta (AKS Karpenter provider)GA -- fully supportedEvaluate Karpenter beta; default to CA for production

GKE's Node Auto-Provisioning (NAP) offers Karpenter-like capabilities natively: it provisions optimal machine types from GCP's full catalog without pre-defined node pools. If you are on GKE, NAP is the closest equivalent to Karpenter and is GA. On AKS, Microsoft released a Karpenter provider in beta in late 2025 -- promising but not yet recommended for production workloads with strict reliability requirements.

When to Stick with Cluster Autoscaler

Karpenter is not universally better. Use Cluster Autoscaler when:

  • You are on GKE or AKS in production -- CA is the mature, supported option. GKE's NAP is a better alternative than waiting for Karpenter support.
  • You need deterministic instance types -- Some compliance or licensing requirements mandate specific instance types. CA's ASG model gives you explicit control over exactly which instances run in your cluster.
  • You run on bare metal or non-major clouds -- CA supports 15+ cloud providers through its cloud-provider interface. Karpenter only supports AWS (GA) and Azure (beta).
  • Your team is not ready for the migration -- CA works. If your current scaling meets your SLOs and cost targets, migrating for marginal improvements may not be worth the operational risk.

Frequently Asked Questions

Can Karpenter and Cluster Autoscaler run simultaneously?

Yes. They manage separate sets of nodes identified by different annotations and labels. Karpenter manages nodes it provisions (labeled with karpenter.sh/nodepool), and CA manages nodes in its discovered ASGs. This coexistence is how you perform a zero-downtime migration. Just ensure that CA's ASGs and Karpenter's NodePools don't target the same subnets with conflicting configurations, as this could lead to both tools trying to provision for the same pending pods.

How does Karpenter handle node updates and patching?

Karpenter's expireAfter field (called ttlSecondsUntilExpired in older versions) automatically rotates nodes after a specified duration. Set it to 720h (30 days) to ensure nodes are regularly replaced with fresh AMIs. When a node expires, Karpenter cordons it, drains pods gracefully, and provisions a replacement with the latest AMI. This eliminates the need for manual node rotation or third-party tools like AWS Systems Manager patch baselines.

What happens if Karpenter itself goes down?

Existing nodes and pods continue running -- Karpenter is not in the data path. However, no new nodes will be provisioned until Karpenter recovers. Run Karpenter with at least 2 replicas and deploy it on a small managed node group (not on Karpenter-provisioned nodes) to avoid a chicken-and-egg problem. EKS Fargate is another option for hosting Karpenter's pods, ensuring they are isolated from node-level failures.

Does Karpenter support GPU workloads?

Yes. Karpenter automatically selects GPU instance types (p4d, p5, g5, g6) when pods request nvidia.com/gpu resources. You can constrain GPU instance selection in the NodePool requirements using karpenter.k8s.aws/instance-gpu-manufacturer and karpenter.k8s.aws/instance-gpu-count labels. Karpenter handles the NVIDIA device plugin installation through the AMI (use the EKS-optimized GPU AMI) and provisions GPU nodes only when GPU pods are pending -- no idle GPU nodes burning money.

How much does Karpenter cost?

Karpenter itself is free and open source. The only cost is the compute it provisions. However, Karpenter typically reduces compute costs by 25-40% compared to Cluster Autoscaler through better bin-packing, ARM instance selection, and Spot usage. The Karpenter controller runs as a Deployment in your cluster consuming roughly 1 vCPU and 1 GiB memory -- negligible compared to the savings it generates.

Can I use Karpenter with Terraform or other IaC tools?

Yes. The Karpenter Helm chart and its CRDs (NodePool, EC2NodeClass) are fully compatible with Terraform, Pulumi, and other IaC tools. The EKS Blueprints Terraform module includes a Karpenter add-on that handles IAM roles, SQS queues for interruption handling, and the Helm installation. For GitOps workflows, Karpenter's CRDs work with ArgoCD and Flux like any other Kubernetes resource.

Is Karpenter production-ready?

On AWS, yes. Karpenter reached v1.0 GA in late 2024 and is now at v1.1. AWS uses Karpenter internally, and it powers node scaling for thousands of production EKS clusters. The CNCF incubating status provides additional governance and community oversight. On Azure, the provider is in beta and should be evaluated with caution for production workloads.

The Bottom Line

If you are running Kubernetes on AWS, Karpenter is the better choice for new clusters and a worthwhile migration for existing ones. Its group-less provisioning model, sub-60-second scaling, native Spot and ARM support, and continuous cost consolidation represent a genuine generational improvement over Cluster Autoscaler. On GKE, use Node Auto-Provisioning for similar benefits. On AKS, evaluate the Karpenter beta but default to Cluster Autoscaler until the provider reaches GA. The right autoscaler is the one that matches your cloud, your constraints, and your operational maturity -- but the direction of the ecosystem is clearly toward Karpenter's approach.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.