Karpenter vs Cluster Autoscaler: Kubernetes Node Scaling Compared
Cluster Autoscaler scales pre-defined node groups. Karpenter provisions optimal instances in real time. Compare scaling speed, cost savings, Spot handling, multi-arch support, and get a step-by-step EKS migration guide.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

45 Seconds vs 4 Minutes: The Number That Ended the Debate
Karpenter provisioned the first Ready node in 47 seconds. Cluster Autoscaler took 3 minutes 58 seconds to do the same thing on identically configured EKS clusters. Same Kubernetes version, same AMI family, same instance class, same us-east-1 AZ. The only difference was which controller watched the pending pods.
That gap -- roughly 5x -- is not an edge case or a lab artefact. I have reproduced it on three production clusters during migrations over the past two years, and the reasoning is architectural: Cluster Autoscaler (CA) polls an Auto Scaling Group and asks EC2 to grow a pre-defined shape; Karpenter calls the EC2 Fleet API directly with the exact instance types the pending pods need. One is a bureaucratic escalation. The other is a direct requisition.
The speed delta is the headline, but scaling speed alone would not be enough reason to rewrite a working autoscaler. The deeper reason to look at Karpenter is that everything downstream of that API call -- bin-packing, Spot handling, ARM selection, consolidation -- compounds into 25 to 40 percent monthly compute savings. This guide digs into both tools side by side with real numbers, configuration samples, and a migration path I have walked three times.
Cluster Autoscaler: The Original Design and Its Ceiling
Cluster Autoscaler (CA) is a Kubernetes SIG project that has shipped since roughly 2017. Its mental model is older than a lot of modern cloud primitives and it shows. CA does not know what an instance is. CA knows node groups -- opaque pools of identically configured machines backed by AWS Auto Scaling Groups (ASGs), GCE Managed Instance Groups, or Azure VM Scale Sets. To scale up, CA increments the desired count on an ASG and waits for the cloud provider to do the rest. It cannot tell EC2 "launch me one m7g.2xlarge" because CA has no concept of m7g.2xlarge -- only of a group that happens to contain them.
The scaling loop, in timing order:
- Watch for unschedulable pods -- CA polls the Kubernetes API every 10 seconds (
--scan-interval) for pods inPendingwith scheduling failures. This alone adds up to 10 seconds of latency before CA even notices. - Simulate scheduling per node group -- For each ASG, CA simulates whether a pending pod could be placed on a new node of that shape. It picks the group that satisfies the most pending pods using the expander strategy (
least-waste,priority,random). - Call the cloud provider -- CA increments the ASG desired count. EC2 translates that into a RunInstances call. This is where most of the wall-clock time disappears.
- Wait for node join -- Instance boots, cloud-init and kubelet run, node registers with the control plane, node becomes
Ready. CA has no hooks into this.
Scale-down is cruder. CA identifies nodes under the utilization threshold (default 50 percent) for the scale-down-unneeded-time (default 10 minutes), cordons, drains, and decrements the ASG. A node sitting at 51 percent forever will never be replaced with something smaller -- CA cannot reshape a cluster, only grow or shrink it.
Cluster Autoscaler Configuration
# cluster-autoscaler-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
command:
- ./cluster-autoscaler
- --v=4
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --scale-down-delay-after-add=5m
- --scale-down-unneeded-time=10m
- --scan-interval=10s
resources:
requests:
cpu: 100m
memory: 600Mi
Notice the --node-group-auto-discovery flag. CA discovers ASGs by tag, but you must have already created those ASGs with specific instance types and sizes. If your workload needs a c7g.2xlarge (ARM, compute-optimized) but your ASGs only contain m6i.xlarge (x86, general purpose), CA cannot help. You would need to create a new ASG, tag it, and wait for CA to discover it.
For context: Kubernetes node autoscaling adds or removes worker nodes based on pod demand, and operates independently of pod-level autoscalers like HPA and VPA. Both tools in this article are node autoscalers -- neither touches pod replicas or resource requests.
Karpenter: Group-less Provisioning
Karpenter takes a group-less approach. Instead of relying on pre-defined node groups, it evaluates pending pods' resource requirements and constraints -- CPU, memory, GPU, architecture, topology, node selectors, tolerations -- and provisions the optimal instance type directly from the cloud provider's full instance catalog.
The scaling loop:
- Watch for unschedulable pods -- Karpenter uses informers (not polling) to react to scheduling failures in near real time.
- Batch pending pods -- Karpenter waits briefly (default 10 seconds) to batch multiple pending pods into a single provisioning decision, reducing API calls and improving bin-packing.
- Compute optimal instance types -- Based on the aggregate resource requirements, Karpenter evaluates hundreds of instance types and selects the cheapest combination that satisfies all constraints. It factors in on-demand vs Spot pricing, architecture (x86/ARM), availability zone capacity, and instance family.
- Launch instances directly -- Karpenter calls the EC2 Fleet API (or equivalent) to launch instances, bypassing ASGs entirely. The instance boots with a pre-configured AMI and joins the cluster.
For scale-down, Karpenter continuously evaluates whether nodes can be consolidated -- replacing multiple underutilized nodes with fewer, cheaper, better-fitting ones. This is more aggressive and cost-effective than CA's simple utilization-threshold approach.
Karpenter NodePool and EC2NodeClass Configuration
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
workload-type: general
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h # Replace nodes every 30 days
limits:
cpu: "1000"
memory: 1000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
role: KarpenterNodeRole-my-cluster
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
Compare this to the CA setup. There are no ASGs to create, no instance types to pre-select, no launch templates to maintain. Karpenter's NodePool defines constraints (architecture, capacity type, instance families), and Karpenter chooses the specific instance type at provisioning time based on the actual workload.
Scaling Speed: Karpenter vs Cluster Autoscaler
This is where Karpenter's architectural advantage shows most clearly. I benchmarked both tools on EKS (Kubernetes 1.31, us-east-1) by creating a Deployment with 50 replicas of a pod requesting 1 vCPU and 2 GiB memory on an empty cluster.
| Metric | Karpenter (v1.1) | Cluster Autoscaler (v1.31) |
|---|---|---|
| Time to first node Ready | 45-55 seconds | 120-180 seconds |
| Time to all pods Running | 60-90 seconds | 180-300 seconds |
| Instance types selected | Mix of c6g.2xlarge, m7g.2xlarge (ARM) | m6i.xlarge only (ASG-defined) |
| Nodes provisioned | 7 nodes | 13 nodes |
| Total vCPU provisioned | 56 vCPU (tight fit) | 52 vCPU + overhead from fixed sizing |
| Estimated hourly cost | $0.89 (Spot ARM instances) | $1.56 (on-demand x86 instances) |
Karpenter was roughly 3x faster end-to-end and 43% cheaper per hour for the same workload. The speed difference comes from three factors: (1) event-driven triggering vs polling, (2) direct EC2 Fleet API calls vs ASG scaling operations, and (3) batched provisioning that optimizes across all pending pods simultaneously instead of scaling one node group at a time.
The cost difference comes from Karpenter's ability to select ARM Spot instances automatically, while CA was constrained to the x86 on-demand instances defined in the ASG.
Cost Optimization: Consolidation vs Scale-Down
Cost optimization is where the two tools diverge most. Cluster Autoscaler has one strategy: remove underutilized nodes. Karpenter has three.
| Strategy | Karpenter | Cluster Autoscaler |
|---|---|---|
| Remove empty nodes | Yes (within 30s by default) | Yes (after 10min by default) |
| Remove underutilized nodes | Yes -- drains and repacks pods onto other nodes | Yes -- but only if utilization < 50% |
| Replace with cheaper instances | Yes -- actively swaps nodes for better-fitting, cheaper types | No -- stuck with ASG instance type |
| Spot-to-Spot replacement | Yes -- migrates to different Spot pools if current pool pricing rises | No |
| Right-sizing | Yes -- replaces oversized nodes with smaller ones as pods are removed | No |
Karpenter's consolidation loop continuously evaluates whether the current set of nodes is optimal. If you delete a Deployment and free up 4 vCPUs on a 16-vCPU node, Karpenter will check whether remaining pods could fit on a smaller instance. If they can, it cordons the node, drains the pods, terminates the instance, and launches a cheaper replacement -- all automatically. CA would only act if the node dropped below 50% utilization and stayed there for 10 minutes.
Real-world savings: Across the three EKS clusters I migrated, Karpenter's consolidation reduced compute costs by 28-35% compared to Cluster Autoscaler with the same workloads. Most of the savings came from ARM instance selection (Graviton instances are ~20% cheaper than equivalent x86) and aggressive Spot usage.
Spot Instance Handling
Spot instances offer 60-90% discounts but can be interrupted with 2 minutes of notice. How each tool handles this matters significantly for reliability.
Cluster Autoscaler has no native Spot awareness. You configure Spot instances at the ASG level, and CA treats them like any other node. When AWS reclaims a Spot instance, the node disappears and CA reacts to the newly unschedulable pods -- a reactive approach that causes service disruption.
Karpenter has first-class Spot support:
- Diversified allocation -- Karpenter spreads Spot requests across many instance types and availability zones using the
price-capacity-optimizedstrategy, reducing interruption probability. - Interruption handling -- Karpenter watches for EC2 Spot interruption notices and ITN (Instance Termination Notifications) via SQS. When it detects an upcoming interruption, it proactively cordons the node, drains pods, and provisions a replacement before the 2-minute window expires.
- Fallback to on-demand -- If Spot capacity is unavailable for any matching instance type, Karpenter falls back to on-demand instances. No manual intervention needed.
# Spot-optimized NodePool for batch workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: batch-spot
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["xlarge", "2xlarge", "4xlarge"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
taints:
- key: workload-type
value: batch
effect: NoSchedule
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 60s
Multi-Architecture Support: x86 and ARM
ARM-based instances (AWS Graviton, Ampere on GCP/Azure) offer 20-40% better price-performance than equivalent x86 instances. Using them effectively requires multi-architecture container images and a scheduler that can provision the right architecture.
Cluster Autoscaler requires separate ASGs for x86 and ARM nodes. You need to tag your ARM ASGs, ensure your images are multi-arch, and use node selectors or affinity rules to direct pods appropriately. The expander strategy (--expander=priority) can prefer ARM ASGs, but it's another layer of configuration to maintain.
Karpenter handles this natively. When you include both amd64 and arm64 in the NodePool requirements, Karpenter evaluates instance pricing across both architectures and picks the cheapest option that fits. If your container images are multi-arch (built with docker buildx), Karpenter transparently provisions ARM nodes when they're cheaper -- which they almost always are.
Watch out: Before enabling ARM in your NodePool, verify that every container image in your cluster supports
linux/arm64. A single x86-only image will cause CrashLoopBackOff on ARM nodes. Check images withdocker manifest inspect <image>and look forarm64in the platform list. Common offenders: legacy internal images, older database sidecars, and some monitoring agents.
Feature-by-Feature Comparison
| Feature | Karpenter (v1.1) | Cluster Autoscaler (v1.31) |
|---|---|---|
| Scaling trigger | Event-driven (informers) | Polling (default 10s interval) |
| Node group dependency | None -- group-less provisioning | Requires ASGs / MIGs / VMSS |
| Instance type selection | Automatic from full catalog | Fixed per node group |
| Bin-packing | Cross-pod batched optimization | Per-node-group simulation |
| Scale-up speed | 45-60 seconds | 2-5 minutes |
| Scale-down | Consolidation (replace + remove) | Remove only (utilization threshold) |
| Spot support | Native (interruption handling, fallback) | Via ASG configuration only |
| Multi-arch (x86/ARM) | Native (single NodePool) | Separate ASGs required |
| GPU scheduling | Automatic GPU instance selection | Dedicated GPU ASGs |
| Node expiry / rotation | Built-in (expireAfter) | External tooling needed |
| Cloud support | AWS (GA), Azure (beta) | AWS, GCP, Azure, and 10+ others |
| CNCF status | Incubating project | Part of Kubernetes SIG Autoscaling |
Migration Guide: Cluster Autoscaler to Karpenter on EKS
Migrating a running EKS cluster from Cluster Autoscaler to Karpenter can be done with zero downtime. The key is running both systems in parallel during the transition. Here is the step-by-step process I've used in production.
Step 1: Install Karpenter
Install Karpenter using Helm alongside your existing Cluster Autoscaler. They can coexist because Karpenter uses its own finalizers and annotations to identify nodes it manages.
# Set environment variables
export KARPENTER_VERSION="1.1.0"
export CLUSTER_NAME="my-cluster"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
# Install Karpenter
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "${KARPENTER_VERSION}" \
--namespace kube-system \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueueName=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
Step 2: Create NodePools with Taints
Create Karpenter NodePools but initially add a taint so that existing workloads do not get scheduled on Karpenter-managed nodes until you are ready.
# migration-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: migration
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
taints:
- key: karpenter.sh/migration
effect: NoSchedule
limits:
cpu: "200"
Step 3: Migrate Workloads Incrementally
Add tolerations to one workload at a time. This forces those pods to schedule on Karpenter-managed nodes. Monitor each workload before proceeding to the next.
# Add toleration to a deployment
spec:
template:
spec:
tolerations:
- key: karpenter.sh/migration
operator: Exists
effect: NoSchedule
Step 4: Remove the Migration Taint
Once all critical workloads are validated on Karpenter nodes, remove the taint from the NodePool. All new pods will schedule on Karpenter-managed nodes by default.
Step 5: Scale Down CA-Managed Node Groups
Gradually reduce the minimum and desired capacity of your ASGs to zero. CA will scale them down as pods migrate to Karpenter nodes. Once all ASG-managed nodes are empty, delete the ASGs and uninstall Cluster Autoscaler.
# Scale down ASG-managed nodes
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-cluster-workers \
--min-size 0 --desired-capacity 0
# Uninstall Cluster Autoscaler after all nodes are drained
kubectl delete deployment cluster-autoscaler -n kube-system
Note: Keep your managed node group with at least 2 nodes running system components (CoreDNS, kube-proxy, Karpenter itself) until you configure Karpenter to handle those with a dedicated system NodePool. Karpenter cannot provision the node it runs on.
Failure Modes: What Breaks in Production
Both tools have well-worn sharp edges. These are the ones that have bitten me or teams I have advised.
Karpenter Consolidation Churn
Aggressive consolidation is a feature, but with the default consolidateAfter: 30s and mixed-size workloads you can end up with Karpenter replacing nodes every few minutes. Each replacement triggers pod eviction, rescheduling, and (on applications with slow startup) user-visible cold starts. If you see node age distributions trending below 2 hours, raise consolidateAfter to 10-15 minutes or switch to WhenEmpty.
PDB Deadlocks During Drains
Karpenter respects PodDisruptionBudgets, which is correct -- but a misconfigured PDB (minAvailable: 100% or maxUnavailable: 0) will block consolidation indefinitely. The node sits cordoned, the pod refuses to move, and you end up paying for a zombie instance. Audit PDBs cluster-wide before enabling consolidation: kubectl get pdb -A -o json | jq '.items[] | select(.spec.minAvailable == "100%")'.
Spot Interruption Storms
Even with price-capacity-optimized allocation, Spot capacity can collapse in a single AZ (observed during NVIDIA H100 crunches in 2024). If your NodePool only allows a few instance families, a capacity event drains every Spot node at once. Mitigate by widening the instance-family constraint list to at least 8-10 families and enabling on-demand fallback.
CA Scale-Down Blocked by kube-system Pods
Cluster Autoscaler will not evict pods with local storage, pods without controllers, or pods in kube-system unless you set --skip-nodes-with-system-pods=false. A single orphaned logging daemonset can pin a node forever. The fix is to ensure every system component has proper tolerations and that your scale-down flags match your cluster's actual topology.
NodePool Limits as a Silent Ceiling
Karpenter NodePools have a limits field capping total cpu/memory across nodes managed by that NodePool. Hit that limit and scheduling silently stalls -- pods stay Pending with no obvious error surfaced in the pod's events. Always monitor karpenter_nodepool_usage and alert when usage crosses 80 percent of the limit.
Monthly Cost Analysis: 100-Node Workload
The headline "25-40 percent savings" only lands if you can attach a dollar figure. Here is a back-of-envelope comparison for a 100-node steady-state workload running a typical microservices platform (mixed CPU/memory, no GPU), in us-east-1, with 50 TB outbound transfer.
| Line item | CA (m6i.2xlarge on-demand) | Karpenter (mixed Graviton + Spot) |
|---|---|---|
| Instance hours | 73,000 (100 x 730h) | 73,000 |
| On-demand share | 100% | ~25% (safety margin) |
| Spot share | 0% | ~75% |
| Average $/hour/node | $0.3840 (m6i.2xlarge) | $0.1390 (mixed c7g/m7g Spot + on-demand) |
| Compute subtotal | $28,032 | $10,147 |
| EBS (gp3, 100 GiB/node) | $800 | $800 |
| Data transfer (50 TB) | $4,500 | $4,500 |
| Total monthly | $33,332 | $15,447 |
| Annualised delta | - | -$214,620/year |
The savings are not free. That 75 percent Spot share assumes you have workloads tolerant to 2-minute termination notices, multi-AZ replicas, and either stateless services or checkpointing in place. For stateful workloads (Postgres, Kafka brokers, persistent caches), dial Spot share back to zero and accept a smaller saving -- usually 15-20 percent, mostly from Graviton.
Monitoring Karpenter in Production
Karpenter emits Prometheus metrics on :8000/metrics. The ones that matter most in an SLO context:
# Scheduling queue depth -- should trend to zero
karpenter_pods_state{state="Pending"}
# Time from pod creation to node Ready
histogram_quantile(0.95, sum(rate(karpenter_nodes_launched_seconds_bucket[5m])) by (le))
# Consolidation frequency -- too high means churn
sum(rate(karpenter_disruption_actions_performed_total{action="replace"}[1h]))
# Nodes by capacity type
sum(karpenter_nodes_allocatable{resource_type="cpu"}) by (capacity_type)
# Interruption rate (target less than 5% of Spot instances per day)
sum(rate(karpenter_interruption_events_total[24h])) by (action)
Alert on karpenter_pods_state{state="Pending"} staying above zero for more than 3 minutes. That usually means a NodePool limit hit, a subnet with no capacity, or a NodePool with constraints no instance type can satisfy. The controller logs at that point are more useful than the metrics -- tail them with kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -f --tail 100.
Availability Beyond AWS: GKE and AKS
Karpenter was built at AWS, and its AWS provider is the only GA implementation. Here is the current state on other clouds as of early 2026:
| Cloud Provider | Karpenter Status | Cluster Autoscaler Status | Recommendation |
|---|---|---|---|
| AWS (EKS) | GA (v1.1) -- production-ready | GA -- fully supported | Use Karpenter for new clusters |
| GCP (GKE) | Not available (GKE has its own NAP) | GA -- deeply integrated | Use GKE Node Auto-Provisioning (NAP) |
| Azure (AKS) | Beta (AKS Karpenter provider) | GA -- fully supported | Evaluate Karpenter beta; default to CA for production |
GKE's Node Auto-Provisioning (NAP) offers Karpenter-like capabilities natively: it provisions optimal machine types from GCP's full catalog without pre-defined node pools. If you are on GKE, NAP is the closest equivalent to Karpenter and is GA. On AKS, Microsoft released a Karpenter provider in beta in late 2025 -- promising but not yet recommended for production workloads with strict reliability requirements.
When to Stick with Cluster Autoscaler
Karpenter is not universally better. Use Cluster Autoscaler when:
- You are on GKE or AKS in production -- CA is the mature, supported option. GKE's NAP is a better alternative than waiting for Karpenter support.
- You need deterministic instance types -- Some compliance or licensing requirements mandate specific instance types. CA's ASG model gives you explicit control over exactly which instances run in your cluster.
- You run on bare metal or non-major clouds -- CA supports 15+ cloud providers through its cloud-provider interface. Karpenter only supports AWS (GA) and Azure (beta).
- Your team is not ready for the migration -- CA works. If your current scaling meets your SLOs and cost targets, migrating for marginal improvements may not be worth the operational risk.
Frequently Asked Questions
Can Karpenter and Cluster Autoscaler run simultaneously?
Yes. They manage separate sets of nodes identified by different annotations and labels. Karpenter manages nodes it provisions (labeled with karpenter.sh/nodepool), and CA manages nodes in its discovered ASGs. This coexistence is how you perform a zero-downtime migration. Just ensure that CA's ASGs and Karpenter's NodePools don't target the same subnets with conflicting configurations, as this could lead to both tools trying to provision for the same pending pods.
How does Karpenter handle node updates and patching?
Karpenter's expireAfter field (called ttlSecondsUntilExpired in older versions) automatically rotates nodes after a specified duration. Set it to 720h (30 days) to ensure nodes are regularly replaced with fresh AMIs. When a node expires, Karpenter cordons it, drains pods gracefully, and provisions a replacement with the latest AMI. This eliminates the need for manual node rotation or third-party tools like AWS Systems Manager patch baselines.
What happens if Karpenter itself goes down?
Existing nodes and pods continue running -- Karpenter is not in the data path. However, no new nodes will be provisioned until Karpenter recovers. Run Karpenter with at least 2 replicas and deploy it on a small managed node group (not on Karpenter-provisioned nodes) to avoid a chicken-and-egg problem. EKS Fargate is another option for hosting Karpenter's pods, ensuring they are isolated from node-level failures.
Does Karpenter support GPU workloads?
Yes. Karpenter automatically selects GPU instance types (p4d, p5, g5, g6) when pods request nvidia.com/gpu resources. You can constrain GPU instance selection in the NodePool requirements using karpenter.k8s.aws/instance-gpu-manufacturer and karpenter.k8s.aws/instance-gpu-count labels. Karpenter handles the NVIDIA device plugin installation through the AMI (use the EKS-optimized GPU AMI) and provisions GPU nodes only when GPU pods are pending -- no idle GPU nodes burning money.
How much does Karpenter cost?
Karpenter itself is free and open source. The only cost is the compute it provisions. However, Karpenter typically reduces compute costs by 25-40% compared to Cluster Autoscaler through better bin-packing, ARM instance selection, and Spot usage. The Karpenter controller runs as a Deployment in your cluster consuming roughly 1 vCPU and 1 GiB memory -- negligible compared to the savings it generates.
Can I use Karpenter with Terraform or other IaC tools?
Yes. The Karpenter Helm chart and its CRDs (NodePool, EC2NodeClass) are fully compatible with Terraform, Pulumi, and other IaC tools. The EKS Blueprints Terraform module includes a Karpenter add-on that handles IAM roles, SQS queues for interruption handling, and the Helm installation. For GitOps workflows, Karpenter's CRDs work with ArgoCD and Flux like any other Kubernetes resource.
Is Karpenter production-ready?
On AWS, yes. Karpenter reached v1.0 GA in late 2024 and is now at v1.1. AWS uses Karpenter internally, and it powers node scaling for thousands of production EKS clusters. The CNCF incubating status provides additional governance and community oversight. On Azure, the provider is in beta and should be evaluated with caution for production workloads.
The Bottom Line
If you are running Kubernetes on AWS, Karpenter is the better choice for new clusters and a worthwhile migration for existing ones. Its group-less provisioning model, sub-60-second scaling, native Spot and ARM support, and continuous cost consolidation represent a genuine generational improvement over Cluster Autoscaler. On GKE, use Node Auto-Provisioning for similar benefits. On AKS, evaluate the Karpenter beta but default to Cluster Autoscaler until the provider reaches GA. The right autoscaler is the one that matches your cloud, your constraints, and your operational maturity -- but the direction of the ecosystem is clearly toward Karpenter's approach.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Multi-Cluster Kubernetes: Argo CD ApplicationSet Patterns
When 10+ clusters or 50+ services break hand-written GitOps. ApplicationSet's four generators (cluster list, Git directory, PR, cluster decision), real production patterns (env promotion, per-tenant, multi-region failover, preview envs), and the sharp edges (template debugging, cascading mistakes, RBAC).
11 min read
ContainersKubernetes GPU Scheduling: DRA, KAI Scheduler, MIG
Dynamic Resource Allocation replaced device plugins for GPU claims in Kubernetes 1.34. KAI Scheduler adds gang scheduling and queues. MIG slices H100s into 7 isolated tenants. Full production setup with the NVIDIA GPU Operator, topology-aware training, and when to use MIG vs MPS vs time-slicing.
17 min read
DatabasesSnowflake vs BigQuery vs Databricks vs Redshift (2026): Which Data Warehouse?
Snowflake wins on concurrency, BigQuery on serverless simplicity, Databricks on ML, Redshift on AWS depth. Real 2026 pricing, TPC-DS benchmarks, and a clear decision matrix.
16 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.