Kubernetes GPU Scheduling: DRA, KAI Scheduler, MIG
Dynamic Resource Allocation replaced device plugins for GPU claims in Kubernetes 1.34. KAI Scheduler adds gang scheduling and queues. MIG slices H100s into 7 isolated tenants. Full production setup with the NVIDIA GPU Operator, topology-aware training, and when to use MIG vs MPS vs time-slicing.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

GPU Scheduling on Kubernetes Finally Grew Up
Kubernetes GPU scheduling used to mean "set nvidia.com/gpu: 1 on your Pod and hope." That model -- the device-plugin framework shipped in 2018 -- assumed one workload, one whole GPU, and no concept of topology, sharing, or priority. It was fine for a few Jupyter notebooks. It is catastrophically wrong for the 2026 reality of mixed training-inference-notebook clusters where an 8x H100 node costs more per hour than a mid-sized engineering team.
At KubeCon North America 2025 NVIDIA donated the k8s-dra-driver to the CNCF, Dynamic Resource Allocation (DRA) went GA in Kubernetes 1.34, and the KAI Scheduler became the de facto reference for AI-workload-aware scheduling. This article is the practitioner's map of that stack: what DRA, KAI, MIG, MPS, and time-slicing each do, when to combine them, and the production-install walkthrough I've run three times now on mixed A100/H100 clusters.
The short version: DRA replaces device plugins, KAI replaces the default scheduler for AI queues, MIG slices big GPUs into hard-isolated tenants, MPS does soft shared sharing, and time-slicing is the dev-cluster default. Pick right and a $400K GPU cluster feeds three teams cleanly. Pick wrong and you pay for idle silicon while pods sit pending. The edge cases I've hit in production -- MIG reconfiguration races, KAI queue starvation under DRA, driver-version drift across nodes -- I send to the newsletter.
Last updated: April 2026 -- verified against Kubernetes 1.34, NVIDIA GPU Operator 24.9, KAI Scheduler 0.5, and CUDA 12.6 driver stack.
What Is Kubernetes GPU Scheduling?
Definition: Kubernetes GPU scheduling is the process by which the control plane assigns GPU devices -- whole, partitioned, or shared -- to Pods, and the kubelet on each node exposes those devices to containers via the NVIDIA container runtime. In 2026 the primary mechanism is Dynamic Resource Allocation (DRA): Pods declare
ResourceClaimobjects that a DRA driver materializes at bind time, replacing the old device-plugin model where GPUs were exposed as opaque extended resources.
The old model extended nvidia.com/gpu the same way it extended cpu or memory: an integer count the kube-scheduler bin-packed onto nodes. That worked when every GPU was identical and whole. It broke the moment you wanted two workloads on one A100, NVLink topology for tensor-parallel training, or a MIG slice for a notebook alongside a full GPU. DRA makes the device a first-class API object with its own lifecycle and parameters.
Dynamic Resource Allocation (DRA): The New Default
DRA graduated to GA in Kubernetes 1.34 (released March 2026). It replaces the device-plugin framework with a ResourceClaim API, the same way PersistentVolumeClaim replaced hardcoded volume mounts. You no longer request "one GPU" -- you request a claim against a DeviceClass (for example gpu.nvidia.com), optionally parameterized with constraints (MIG profile, minimum VRAM, interconnect). A DRA driver DaemonSet on each node resolves that claim to a real device at bind time.
The practical difference: DRA claims can be shared across Pods, reserved before scheduling, and parameterized with structured constraints. A training job can claim "4 H100s on the same NVLink island with at least 80GB HBM" and the scheduler honors topology. Under device plugins you got four GPUs that happened to land on a node, with no topology awareness.
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: single-h100
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: "device.attributes['gpu.nvidia.com'].productName == 'NVIDIA H100 80GB HBM3'"
---
apiVersion: v1
kind: Pod
metadata:
name: llm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: single-h100
Notice what's missing: no nvidia.com/gpu: 1. The claim carries the intent -- "I want an H100 80GB" -- and the DRA driver matches. Same inversion that made Kubernetes resource requests and limits sane a decade ago: declarative intent, not opaque counts.
Why DRA Replaces Device Plugins
- Topology awareness: DRA drivers report NVLink groups, PCIe root complexes, and NUMA affinity. Training workloads can constrain placement to avoid cross-socket traffic.
- Fine-grained sharing: One
ResourceClaimwithadminAccess: truecan be attached to multiple Pods, giving MPS or driver-level sharing semantics the old model couldn't express. - Parameterized selection: CEL expressions on device attributes let you pick by product name, memory size, compute capability, or vendor-specific labels without hardcoding node selectors.
- Lifecycle control: Claims have their own object lifetime, so a notebook can hold a GPU across Pod restarts -- impossible with device plugins.
Watch out: DRA is only GA on Kubernetes 1.34+. On 1.32 and 1.33 it ships as beta behind the
DynamicResourceAllocationfeature gate. Most managed offerings (EKS 1.34, GKE 1.34, AKS 1.34) shipped DRA-capable control planes in Q1-Q2 2026. If your cluster is older, you're still on device plugins until you upgrade -- there is no backport.
KAI Scheduler: Replacing the Default Scheduler for AI
DRA handles the device abstraction; KAI handles the workload abstraction. The default kube-scheduler schedules Pods one at a time using filter-and-score. That breaks for distributed training where a 16-GPU job either runs all-at-once or doesn't run -- schedule 14 of 16 Pods, leave 2 pending, and you've burned $300/hour on idle GPUs waiting for capacity that never arrives. KAI adds gang scheduling, hierarchical queues, and priority-based preemption built for ML workloads.
KAI runs alongside kube-scheduler, not as a replacement. Pods opt in per-Pod via schedulerName: kai-scheduler. I've run this split on three production clusters -- inference on kube-scheduler, training on KAI -- and it scales cleanly.
apiVersion: kai.scheduler/v1alpha1
kind: Queue
metadata:
name: team-ml-research
spec:
parent: root
resources:
gpu:
quota: 16
limit: 32
overQuotaWeight: 2
priority: 100
---
apiVersion: scheduling.run.ai/v1
kind: PodGroup
metadata:
name: llama-finetune
spec:
minMember: 8
queue: team-ml-research
priorityClassName: train-high
---
apiVersion: batch/v1
kind: Job
metadata:
name: llama-finetune
spec:
parallelism: 8
completions: 8
template:
metadata:
labels:
pod-group-name: llama-finetune
spec:
schedulerName: kai-scheduler
containers:
- name: trainer
image: myorg/llama-trainer:v2
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: h100-nvlink-group
The PodGroup says "schedule all 8 Pods atomically, or none." The Queue says "team-ml-research has 16 guaranteed GPUs, can burst to 32 if idle, and borrows at priority 100." When another team fires a higher-priority job, KAI preempts burst capacity cleanly. NVIDIA open-sourced the scheduler in 2024 and it's now the reference implementation for GPU fairness on Kubernetes.
KAI vs Kueue vs Volcano
KAI isn't the only option. Kueue (Kubernetes SIG, thin queue layer delegating to kube-scheduler) and Volcano (CNCF batch scheduler, older, HPC-oriented) also do gang scheduling. The practical split:
- KAI: best when workloads are mostly GPU AI/ML and you want topology-aware gang scheduling with DRA integration. NVIDIA-aligned.
- Kueue: best when you want a thin queue layer on top of the default scheduler and workloads mix CPU and GPU batch.
- Volcano: best for HPC-style MPI jobs and Slurm-trained operators.
MIG, MPS, and Time-Slicing: Three Ways to Share One GPU
Even with DRA and KAI, you still need to answer: does one Pod own a whole GPU, or do multiple Pods share one? NVIDIA offers three sharing models, each with different isolation, performance, and operational cost.
| Mode | Isolation | GPUs Supported | Max Tenants | Best For |
|---|---|---|---|---|
| MIG (Multi-Instance GPU) | Hardware -- separate memory, SM partitions, fault domains | A100, H100, H200, B100 | 7 per GPU | Production multi-tenant inference, SaaS |
| MPS (Multi-Process Service) | Software -- shared SMs, shared memory bus | Any CUDA GPU | ~48 clients (driver-capped) | Concurrent small inference jobs, low-latency batching |
| Time-slicing | None -- time-share via CUDA context switch | Any CUDA GPU | Unlimited (oversubscribe) | Dev clusters, Jupyter notebooks, CI |
| Whole GPU | Exclusive | Any | 1 | Large-model training, latency-critical inference |
These aren't mutually exclusive. A typical production cluster runs MIG on inference nodes, whole-GPU on training nodes, and time-slicing on the dev namespace. DRA lets you express all three as different DeviceClasses without separate node pools.
MIG: When Seven Slices of an A100 Beats One Whole GPU
MIG physically partitions a supported GPU into up to 7 isolated instances with dedicated SMs, L2 cache, memory channels, and fault domains. An H100 80GB can be sliced as 7 x 10GB, 4 x 20GB, or other valid profiles. Each slice appears as a separate CUDA device -- a Pod claims one slice and never sees the others. NVIDIA's MIG user guide documents the profile geometries.
MIG shines for many small inference services -- a Llama 3 8B at int4 fits in a 10GB slice and saturates it. Seven such services on one H100 deliver roughly 7x the throughput of running them serially, with memory isolation. I migrated a SaaS inference fleet from whole-A100s to MIG 7x in late 2024 and cut GPU cost per request by 68%.
# MIG configuration applied by NVIDIA GPU Operator
apiVersion: v1
kind: ConfigMap
metadata:
name: default-mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
mixed-inference:
- devices: [0, 1]
mig-enabled: true
mig-devices:
"2g.20gb": 3
"1g.10gb": 1
- devices: [2, 3]
mig-enabled: false # whole GPUs for training
Watch out: MIG reconfiguration requires GPU reset, which drains the node. Don't change MIG profiles on a running cluster without draining pods first. The GPU Operator's
mig-managerhandles drain-reconfigure-uncordon if you setnvidia.com/mig.config.state.pendingcorrectly, but I've seen it race with fast-reconciling controllers. Schedule MIG reconfigs during maintenance windows.
MPS: Concurrent Small Jobs on One GPU
MPS runs a server process that multiplexes CUDA streams from multiple clients onto one GPU, sharing SMs concurrently instead of time-slicing. Performance beats time-slicing for kernel-launch-heavy workloads but isolation is weaker than MIG -- a crash in one client can take down the MPS server. Use MPS for low-latency inference batching where MIG's 7-slice cap limits you; avoid it for multi-tenant production.
Time-Slicing: Dev Cluster Default
Time-slicing exposes one GPU as N replicas; the driver context-switches between clients. No isolation, no performance guarantees -- a compute-heavy notebook can freeze its neighbors. But for dev clusters where the GPU sits idle 90% of the time, letting 10 engineers each claim "a GPU" via time-slicing keeps costs reasonable. Configure via the GPU Operator's timeSlicing CRD at 4-8 replicas.
Production Setup: NVIDIA GPU Operator, DRA, and KAI
Getting this stack production-ready takes roughly 90 minutes on a fresh cluster. The sequence below is validated on EKS 1.34, GKE 1.34, and on-prem kubeadm with A100 and H100 nodes.
Step 1: Label GPU Nodes and Install the NVIDIA GPU Operator
The NVIDIA GPU Operator handles driver install, container runtime config, device plugin (legacy), MIG manager, DCGM exporter, and the NVIDIA container toolkit. Install it before DRA -- the operator provisions the kernel modules DRA relies on.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait \
-n gpu-operator --create-namespace \
gpu-operator nvidia/gpu-operator \
--version v24.9.2 \
--set driver.version=550.127.05 \
--set toolkit.enabled=true \
--set migManager.enabled=true \
--set dcgmExporter.enabled=true
Verify DaemonSets reach Ready: nvidia-driver-daemonset, nvidia-container-toolkit, nvidia-mig-manager, nvidia-dcgm-exporter. If drivers fail, check kernel version against NVIDIA's compatibility matrix -- mismatches are the #1 cause of bricked GPU nodes.
Step 2: Install the DRA Driver
The Kubernetes DRA docs cover the generic API; NVIDIA's driver implements it for GPUs. Install after the GPU Operator has deployed drivers.
helm install --wait \
-n nvidia-dra-driver --create-namespace \
nvidia-dra-driver nvidia/nvidia-dra-driver \
--version v25.3.0 \
--set resources.gpus.enabled=true \
--set resources.computeDomains.enabled=true
The driver registers a DeviceClass named gpu.nvidia.com and advertises every GPU on each node. Check registration with kubectl get devices.resource.k8s.io -- you should see one entry per physical GPU (or per MIG slice if MIG is enabled).
Step 3: Install KAI Scheduler
KAI runs as a second scheduler alongside kube-scheduler. Pods opt in via schedulerName: kai-scheduler.
helm install --wait \
-n kai-scheduler --create-namespace \
kai-scheduler nvidia/kai-scheduler \
--version v0.5.1 \
--set global.registry=nvcr.io/nvidia/kai \
--set queueController.enabled=true \
--set podGrouper.enabled=true
Create at least one root Queue before submitting jobs -- PodGroups reference a queue and KAI rejects ungrouped GPU Pods. The queue hierarchy can mirror your team/project structure; I recommend a two-level tree (team -> project) for clusters serving 5+ teams.
Step 4: Submit a Gang-Scheduled Training Job
Combine DRA claim, PodGroup, and Job. The job fans out 8 replicas; KAI gang-schedules them atomically against an NVLink-correlated claim.
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: h100-nvlink-group
spec:
spec:
devices:
requests:
- name: gpus
deviceClassName: gpu.nvidia.com
count: 8
selectors:
- cel:
expression: "device.attributes['gpu.nvidia.com'].productName.startsWith('NVIDIA H100')"
constraints:
- requests: ["gpus"]
matchAttribute: "gpu.nvidia.com/nvlinkDomain"
The matchAttribute constraint forces all 8 GPUs onto the same NVLink domain -- critical for tensor-parallel training where cross-node bandwidth would otherwise bottleneck the job. For distributed training patterns across multiple nodes with InfiniBand, see deploying ML models in production on Kubernetes.
Step 5: Verify and Monitor
Use DCGM Exporter metrics scraped by Prometheus and Grafana to monitor GPU utilization, memory usage, and temperature. The key metric is DCGM_FI_DEV_GPU_UTIL -- if it's below 70% during training, you're likely bottlenecked on CPU dataloading or disk I/O, not GPU compute.
Troubleshooting "No GPUs Available"
This is the error that eats your week. Root causes, ranked by frequency from my on-call rotations:
- Driver not loaded on the node. Check
kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset. If it's CrashLoopBackOff, checkjournalctl -u containerdon the node for NVIDIA module load errors. - DRA driver failed to register devices.
kubectl describe devices.resource.k8s.ioshould list every GPU. If empty, the NVIDIA DRA driver DaemonSet pod is probably stuck initializing. - ResourceClaim selector matches nothing. CEL expressions fail silently when the attribute name is wrong. Log the full device attribute set with
kubectl get devices -o yamland compare. - MIG profile mismatch. The Pod requests a
1g.10gbslice but the node is partitioned as2g.20gb. MIG config is per-node; use node labels to segment. - KAI queue at quota.
kubectl describe queue team-ml-researchshows current vs quota. Burst capacity can be preempted if higher-priority work arrives. - PodGroup
minMembernot met. KAI will hold all replicas Pending until enough GPUs are available for the full group. Checkkubectl describe podgroupfor the reason.
Pro tip: Install the
nvidia-smistatic binary on all GPU nodes and alias a kubectl-debug pod that mounts/dev/nvidia*. When a training job reports "CUDA out of memory" or "no GPUs available," you can shell in and runnvidia-smito see what's actually holding VRAM -- zombie processes from OOM-killed Pods are common on older kernels.
Cost and Economics: Why This Stack Pays for Itself
An 8x H100 SXM node on AWS (p5.48xlarge) runs roughly $98/hour on-demand in Q1 2026. Idle for a day that's $2,350 burned. The DRA + KAI + MIG stack addresses this waste at three layers:
- DRA topology awareness: stops cross-socket training that runs at half speed. Typical recovery: 20-35% wall-clock reduction on tensor-parallel jobs.
- KAI gang scheduling and preemption: keeps expensive nodes saturated. Typical utilization lift: 60% baseline -> 85% sustained.
- MIG on inference nodes: 7x tenants per GPU for small models. Typical cost-per-request reduction: 60-70%.
For capacity strategy beyond one cluster -- when to buy, when to rent -- see best GPU cloud for AI training and best GPU for LLMs. Own baseline, rent burst, never run training without checkpointing.
Recommended Stack by Workload Type
| Workload | Scheduler | Sharing Mode | Typical Claim |
|---|---|---|---|
| Large-model training (70B+) | KAI (PodGroup) | Whole GPU + NVLink constraint | 8x H100 NVLink island |
| Fine-tuning (7-13B) | KAI (PodGroup) | Whole GPU | 1-4x A100/H100 |
| Production inference (small models) | kube-scheduler | MIG 1g.10gb | 1 MIG slice |
| Production inference (large models) | kube-scheduler | Whole GPU or MIG 3g.40gb | 1x H100 or slice |
| Batched low-latency inference | kube-scheduler | MPS | Shared GPU, N clients |
| Dev notebook / Jupyter | kube-scheduler | Time-slicing | 1 of 8 time-slices |
| CI / model validation | kube-scheduler | Time-slicing or MIG | 1 slice, short TTL |
Frequently Asked Questions
What is DRA in Kubernetes?
Dynamic Resource Allocation is a Kubernetes API (GA in 1.34, released March 2026) that replaces the device-plugin framework for exposing specialized hardware to Pods. Pods declare ResourceClaim objects against a DeviceClass, and a DRA driver materializes the claim to a real device at bind time. It supports topology constraints, fine-grained sharing, and parameterized selection -- capabilities the old nvidia.com/gpu: 1 model couldn't express.
Do I need KAI Scheduler if I already have Kueue?
Not necessarily. Kueue provides queue-level admission and quotas on top of the default scheduler; KAI goes further with gang scheduling, topology-aware placement, and DRA integration specifically for GPU AI workloads. If your cluster runs mostly CPU batch jobs with occasional GPU use, Kueue is sufficient. If GPUs are the primary workload and you run distributed training that needs atomic scheduling, KAI is the stronger fit.
When should I use MIG vs time-slicing?
Use MIG for production multi-tenant workloads where isolation matters -- separate memory, SMs, and fault domains prevent one tenant from starving or crashing another. Use time-slicing for dev clusters where the GPU sits idle most of the time and oversubscription is fine. MIG only works on A100, H100, H200, and B100-class GPUs; time-slicing works on any CUDA GPU. Never rely on time-slicing for production SLOs.
Does the NVIDIA GPU Operator support DRA?
The GPU Operator installs the driver stack and manages MIG, but DRA is delivered by a separate NVIDIA component -- the k8s-dra-driver -- installed via its own Helm chart. Both are required in 2026: the GPU Operator for drivers and MIG management, and the DRA driver for exposing devices through the ResourceClaim API. NVIDIA plans tighter integration but as of GPU Operator 24.9 they remain separate charts.
Can I mix MIG and whole-GPU nodes in one cluster?
Yes, and you should. Configure MIG per-node via the GPU Operator's MIG manager -- inference nodes run with 7-slice profiles, training nodes stay whole. Use node labels and DeviceClasses to segment; DRA claims then match the right pool automatically. This is the standard production layout on clusters I've run, and it avoids the fragmentation you'd get from forcing one mode across the whole fleet.
How do I monitor GPU utilization on Kubernetes?
Install NVIDIA DCGM Exporter (ships with the GPU Operator) and scrape its Prometheus endpoint. Key metrics are DCGM_FI_DEV_GPU_UTIL (compute utilization), DCGM_FI_DEV_FB_USED (memory in use), and DCGM_FI_DEV_GPU_TEMP (thermal). Build Grafana dashboards per-node and per-Pod. If utilization is below 70% during training, the bottleneck is usually CPU dataloading or disk I/O, not GPU compute.
What happens when a DRA claim can't be satisfied?
The Pod stays in Pending and an event on the ResourceClaim describes why -- usually "no device matches selector" or "quota exceeded." Unlike device-plugin scheduling failures, DRA exposes structured reasons via the claim object, which makes automation cleaner. Under KAI, an unsatisfiable claim in a PodGroup blocks the entire group from scheduling until capacity appears or the group's priority lets it preempt lower-priority work.
The Bottom Line
Production-grade Kubernetes GPU scheduling in 2026 is three layers, not one. DRA gives you the device abstraction -- declarative claims instead of opaque counts, with real topology awareness. KAI Scheduler gives you the workload abstraction -- gang scheduling and queues built for AI, not retrofitted from CPU batch. MIG, MPS, and time-slicing give you the sharing mechanics -- hardware-isolated tenants on A100/H100 for production, soft-shared GPUs for dev. Getting all three working together takes a day of setup and years off your GPU cost curve. If you're still on nvidia.com/gpu: 1 in 2026, the Kubernetes platform moved on without you -- and your finance team will notice before your SREs do.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Self-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)
A practical comparison of self-hosting LLMs on Indian GPU clouds including E2E Networks, Tata TIR, and Yotta Shakti Cloud, with INR pricing inclusive of 18% GST, latency tests from Mumbai, Bangalore, Chennai, and Delhi, and DPDP Act 2023 compliance notes.
15 min read
ObservabilityAIOps in 2026: AI-Driven Monitoring & Incident Response
AIOps in 2026 cuts alert noise 70-95% and Sev-2 MTTR 20-40% when layered on disciplined alerting. Landscape review of Dynatrace Davis, Datadog Watchdog, PagerDuty AIOps, BigPanda, and 6 more — with honest failure modes.
16 min read
ObservabilityBest Log Management Tools (2026): Splunk vs Datadog Logs vs Loki vs SigNoz
Benchmarked comparison of Splunk, Datadog Logs, Grafana Loki, and SigNoz on a 1.2 TB/day pipeline. Real 2026 pricing, query performance, and a cost-per-GB decision matrix.
15 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.