Skip to main content
Back to blog

Best Tools for Managing GPU Usage in Kubernetes in 2025 and 2026

A practical guide to the Kubernetes GPU management stack in 2026, covering operator setup, metrics, autoscaling, and the commands teams actually run.

GPU waste is the quiet budget problem that compounds in silence. Production Kubernetes clusters run at 30–40% average GPU utilization, according to infrastructure benchmarks from cloud providers and MLOps teams. An idle H100 on GKE costs roughly $30 per day — before you add storage, networking, and operator overhead. At scale, that idle capacity turns into a six-figure annual burn without a single additional workload running.

The good news: the tooling stack for GPU management on Kubernetes has largely stabilized by 2025 and 2026. The NVIDIA GPU Operator handles installation end-to-end. DCGM Exporter feeds utilization metrics to Prometheus. Karpenter handles node-level autoscaling with GPU-aware instance selection. And two techniques that spent years in beta — GPU time-slicing and Multi-Instance GPU (MIG) — are now production-stable and in widespread use on A100 and H100 hardware.

This article walks through each layer of that stack, with the actual kubectl commands engineers run to install, verify, debug, and operate every tool.


Layer 1: NVIDIA GPU Operator — Installation and Verification

The GPU Operator is the starting point for any managed GPU stack on Kubernetes. Before it existed, enabling GPUs on a node required manually installing drivers, the NVIDIA container toolkit, the device plugin, and the DCGM exporter — often in the wrong order, with version mismatches causing silent failures. The GPU Operator collapses that into a single Helm chart.

The Operator runs as a set of DaemonSets and manages the full lifecycle: driver installation, container runtime patching, device plugin configuration, and metrics export. If you are starting fresh on a GPU cluster in 2025 or 2026, the GPU Operator is the correct install path. Manual component installs are now a legacy pattern.

Install the GPU Operator via Helm:

# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator — handles driver, toolkit, device plugin, and DCGM in one chart
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true

Verify all components are running:

# Check all GPU Operator pods are healthy
kubectl get pods -n gpu-operator

# Confirm DaemonSets are fully scheduled
kubectl get daemonset -n gpu-operator

# Verify GPU resources are visible on nodes
kubectl get nodes -o json | \
  jq '.items[] | {name: .metadata.name, gpu: .status.capacity["nvidia.com/gpu"]}'

# Check node labels applied by GPU Feature Discovery
kubectl get nodes --show-labels | grep "nvidia.com"

A healthy install shows every DaemonSet pod in Running state. If a node shows 0 GPU capacity after installation, the most common causes are: driver version mismatch for the kernel, container runtime not configured to use the NVIDIA runtime, or a node that needs a reboot after driver install. Check kubectl describe pod <driver-pod> -n gpu-operator for the root cause before assuming a chart issue.

The device plugin component is what makes GPUs schedulable as nvidia.com/gpu resources. Without it, pods requesting GPUs will stay Pending indefinitely. This was the source of the separate nvidia-device-plugin DaemonSet that predates the Operator — the Operator now manages that component automatically, so you should not run both.


Layer 2: GPU Time-Slicing — Configuration and Validation

Time-slicing allows multiple pods to share a single physical GPU through workload-level context switching. It is the right tool when you have inference workloads with bursty, low-sustained GPU utilization — a common pattern for serving smaller models where each request saturates the GPU briefly and then goes idle.

The tradeoff is visibility: time-slicing does not enforce memory isolation between pods. Two pods sharing one GPU can run each other out of VRAM. For multi-tenant environments or training workloads, use MIG instead. For dedicated inference clusters with trusted workloads, time-slicing with a replica factor of 2–4 reliably improves GPU utilization without hardware changes.

Create and apply a time-slicing ConfigMap:

# Define the time-slicing configuration
cat > time-slicing-config.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4
EOF
kubectl apply -f time-slicing-config.yaml

# Apply the config to the GPU Operator cluster policy
kubectl patch clusterpolicy gpu-cluster-policy \
  -n gpu-operator \
  --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config"}}}}'

# Verify — the node should now advertise 4x its physical GPU count
kubectl get node <gpu-node> -o jsonpath='{.status.capacity.nvidia\.com/gpu}'

# Confirm the ConfigMap is stored correctly
kubectl get configmap -n gpu-operator time-slicing-config -o yaml

Test scheduling multiple pods against one physical GPU:

# Schedule a test pod requesting one time-slice
kubectl run test-gpu-1 --image=nvidia/cuda:12.0-base --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"test","image":"nvidia/cuda:12.0-base","resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}'

# Watch scheduling — all four replicas should schedule on the same node
kubectl get pods -o wide -w

When replicas: 4 is set and you have one physical GPU, kubectl get node <gpu-node> -o jsonpath='{.status.capacity.nvidia\.com/gpu}' should return 4. That is the expected behavior — not a bug. The node is advertising four schedulable units backed by one physical device.


Layer 3: MIG (Multi-Instance GPU) — A100 and H100 Partitioning

MIG provides hardware-level GPU partitioning. Unlike time-slicing, MIG creates isolated compute and memory instances at the hardware level — each MIG slice has guaranteed memory bandwidth and isolation from other slices on the same card. This makes it suitable for multi-tenant training workloads, regulated environments, and any scenario where VRAM contention would be unacceptable.

MIG is available on A100 and H100 GPUs. The most common MIG profile in 2025 and 2026 for A100 80GB is 1g.10gb (seven instances per card, 10GB each) and 3g.40gb (two instances per card, 40GB each). H100 80GB supports analogous profiles.

Enable MIG and request a slice in a pod:

# Label the node to enable MIG with the all-1g.10gb profile
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb

# Verify MIG labels were applied by the MIG Manager
kubectl get node <gpu-node> -o jsonpath='{.metadata.labels}' | \
  jq 'with_entries(select(.key | startswith("nvidia.com/mig")))'

# Check MIG capacity visible in node Capacity section
kubectl describe node <gpu-node> | grep -A10 "Capacity:"

# Deploy a pod requesting a specific MIG slice
cat > mig-pod.yaml << 'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: mig-test
spec:
  containers:
  - name: cuda
    image: nvidia/cuda:12.0-base
    resources:
      limits:
        nvidia.com/mig-1g.10gb: "1"
EOF
kubectl apply -f mig-pod.yaml

# Confirm the pod scheduled and is running
kubectl describe pod mig-test | grep -E "Node:|Status:|nvidia.com/mig"

MIG profiles show up as distinct resource types in kubectl describe node — you will see nvidia.com/mig-1g.10gb, nvidia.com/mig-2g.20gb, and so on, depending on the active profile. If a node has seven 1g.10gb instances configured, it can run seven fully isolated CUDA workloads simultaneously. The DCGM Exporter reports per-instance metrics for each MIG slice, so observability is maintained.

For teams running the kubernetes GPU time-slicing MIG setup transition, the decision rule is straightforward: time-slicing for inference, MIG for training or multi-tenancy.


Layer 4: DCGM Exporter — GPU Metrics to Prometheus

The Data Center GPU Manager (DCGM) Exporter is the standard mechanism for GPU metrics in Kubernetes. It ships as a DaemonSet, queries the NVIDIA management library on each node, and exposes metrics on port 9400 in Prometheus format. When you install the GPU Operator with dcgmExporter.enabled=true, the exporter is already running.

The DCGM exporter prometheus kubernetes GPU monitoring pipeline requires no additional instrumentation in your workloads. The metrics are hardware-level: utilization, memory, power draw, temperature, and error counters. Grafana dashboards for DCGM are available in the NVIDIA GPU Grafana dashboard catalog and integrate directly with these metrics.

Verify and query DCGM metrics:

# Confirm the DCGM exporter DaemonSet is healthy
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

# Check GPU utilization directly via dcgmi inside the pod
kubectl exec -n gpu-operator <dcgm-pod> -- dcgmi dmon -e 203,204

# Port-forward the metrics endpoint locally and query it
kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400 &
curl http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL

Key metrics for GPU monitoring in Kubernetes:

Metric Description
DCGM_FI_DEV_GPU_UTIL GPU compute utilization (0–100%)
DCGM_FI_DEV_MEM_COPY_UTIL Memory bandwidth utilization
DCGM_FI_DEV_FB_USED Framebuffer (GPU memory) used in MB
DCGM_FI_DEV_FB_FREE Framebuffer free in MB
DCGM_FI_DEV_POWER_USAGE Power draw in watts
DCGM_FI_DEV_XID_ERRORS XID error counter — driver/hardware faults

A practical Prometheus alert for GPU idle detection: fire when DCGM_FI_DEV_GPU_UTIL < 5 persists for 30 minutes. That threshold identifies nodes where a workload has crashed or stalled but the pod remains running and holding the GPU resource. Combined with a DeploymentAvailable alert, this catches the common failure mode of a model server that started but never received traffic.


Layer 5: Karpenter — GPU-Aware Autoscaling on EKS

Karpenter is the node autoscaler that replaced the Kubernetes Cluster Autoscaler as the standard approach for EKS in 2024 and 2025. For GPU workloads specifically, Karpenter's instance-type awareness is the critical differentiator: you can express GPU requirements at the hardware level (instance family, GPU model, count) and Karpenter will select the right EC2 instance and provision it in minutes.

The karpenter GPU autoscaling model works in both spot and on-demand mode. For batch training jobs that tolerate interruption, spot GPU instances reduce cost by 60–70% on typical workloads. For latency-sensitive inference, on-demand capacity with a consolidation policy keeps nodes active but rightsizes them as load changes.

Install Karpenter and configure a GPU NodePool:

# Install Karpenter via Helm on EKS
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version v0.37.0 \
  -n karpenter --create-namespace \
  --set settings.clusterName=my-cluster \
  --set settings.interruptionQueue=my-cluster

# Define a GPU-specific NodePool
cat > gpu-nodepool.yaml << 'EOF'
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
      - key: karpenter.k8s.aws/instance-gpu-name
        operator: In
        values: ["a100", "h100"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
  limits:
    nvidia.com/gpu: "16"
EOF
kubectl apply -f gpu-nodepool.yaml

# Watch Karpenter logs for GPU node provisioning activity
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --follow | grep gpu

# Check nodes Karpenter has provisioned
kubectl get nodes -l karpenter.sh/nodepool=gpu-nodepool

The limits field on the NodePool (nvidia.com/gpu: "16") caps total GPU allocation managed by this pool. This is a hard guardrail against runaway workloads consuming unbounded GPU capacity — set it to a value that matches your cost ceiling, not your theoretical maximum.


Finding and Fixing GPU Waste with kubectl

Idle GPU nodes are the highest-impact cost problem in most GPU-enabled clusters. The commands below form a practical audit workflow — run them on a schedule or before a billing review.

Identify GPU workloads and idle nodes:

# Find all pods requesting GPU resources, across all namespaces
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].resources.requests["nvidia.com/gpu"] != null) | \
  {ns: .metadata.namespace, name: .metadata.name, \
   gpu: .spec.containers[].resources.requests["nvidia.com/gpu"], \
   status: .status.phase}'

# Find GPU nodes with no GPU pods scheduled (idle GPU nodes burning money)
kubectl get nodes -l "nvidia.com/gpu.present=true" -o name | while read node; do
  pods=$(kubectl get pods --all-namespaces \
    --field-selector=spec.nodeName=${node#node/} \
    -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | wc -w)
  echo "$node: $pods pods"
done

# Check for pending pods blocked on GPU resource constraints
kubectl get pods --all-namespaces --field-selector=status.phase=Pending -o json | \
  jq '.items[] | select(.status.conditions[] | .reason == "Unschedulable") | \
  {ns: .metadata.namespace, name: .metadata.name}'

# Get GPU-related scheduling failure events
kubectl get events --all-namespaces \
  --field-selector reason=FailedScheduling | grep -i gpu

# Check GPU node capacity at a glance
kubectl describe node <node-name> | grep -A5 "Capacity:"

The idle node check is the most actionable. A node with zero pods and nvidia.com/gpu.present=true is consuming full instance cost. If Karpenter's consolidation policy is enabled (consolidationPolicy: WhenEmptyOrUnderutilized), that node should drain and terminate automatically. If it is not terminating, check whether a DaemonSet pod is pinned to it — that prevents consolidation.


Tool Comparison Table

Tool What it does Install method K8s-native Best for
NVIDIA GPU Operator Full stack install (driver, toolkit, device plugin, DCGM) Helm chart Yes All K8s platforms
DCGM Exporter GPU metrics to Prometheus Included in GPU Operator DaemonSet Prometheus / Grafana stacks
Time-Slicing Share one GPU across multiple pods ConfigMap + ClusterPolicy patch Yes Inference serving, low-contention workloads
MIG Hard hardware-level GPU partitioning Node label (nvidia.com/mig.config) Yes Multi-tenant training, A100/H100 clusters
Karpenter GPU-aware node autoscaling Helm EKS-native EKS clusters with variable GPU demand

Clanker Cloud for GPU Operations

The kubectl commands above work, but they require knowing the right query for each situation, piping to jq, and reading through node descriptions manually. For day-to-day GPU operations, Clanker Cloud wraps the same operations in natural language queries that execute against your live clusters.

# Install the Clanker CLI
brew tap clankercloud/tap && brew install clanker

# Query GPU node status across clusters
clanker ask "show me all GPU nodes across my clusters and their current utilization"

# Find blocked GPU workloads
clanker ask "find pods requesting GPU resources that have been pending for more than 10 minutes"

# Namespace-level GPU memory breakdown
clanker ask "which namespaces have the highest GPU memory usage right now"

# Cross-cloud GPU cost query
clanker ask "what is my total GPU spend across EKS and GKE this month"

For cluster-wide audits, the Deep Research feature fans out across every connected provider simultaneously and returns severity-ranked findings:

clanker ask "run a deep scan of my GPU infrastructure — idle nodes, pending pods, driver version mismatches, cost anomalies"

This is particularly useful before billing reviews or when onboarding a new cluster where GPU utilization baselines are unknown. Clanker Cloud supports AWS, GCP, Azure, and Kubernetes simultaneously, so cross-cloud GPU inventory queries return results from all providers in a single response.

Platform teams using Clanker Cloud alongside the tools described here report it as a complement to their existing kubectl workflows, not a replacement — see the AI DevOps for teams overview for how the two layers fit together. The vibe coding to production guide covers how Clanker Cloud integrates with deployment workflows more broadly.

Full documentation for connecting clusters and configuring providers is at docs.clankercloud.ai. MCP server configuration for agent-based GPU monitoring is documented at /for-ai-agents.md.


FAQ

How do I check GPU utilization in Kubernetes?

The standard approach uses DCGM Exporter, which is included in the NVIDIA GPU Operator. Once installed, port-forward the metrics service and query Prometheus-format metrics:

kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400
curl http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL

The metric DCGM_FI_DEV_GPU_UTIL returns per-GPU utilization as a value from 0 to 100. In a Prometheus + Grafana stack, this feeds directly into GPU dashboards without additional configuration.

What is the difference between GPU time-slicing and MIG in Kubernetes?

Time-slicing is software-level sharing: multiple pods share the same GPU through context switching, with no memory isolation between them. It is straightforward to enable via a ConfigMap and works on any NVIDIA GPU. MIG (Multi-Instance GPU) is hardware-level partitioning available only on A100 and H100 GPUs. Each MIG slice has guaranteed, isolated compute and memory resources. Use time-slicing for inference workloads with trusted tenants. Use MIG for multi-tenant environments or any workload where VRAM isolation is required.

How do I find idle GPU nodes in Kubernetes?

# List GPU nodes and the number of pods running on each
kubectl get nodes -l "nvidia.com/gpu.present=true" -o name | while read node; do
  pods=$(kubectl get pods --all-namespaces \
    --field-selector=spec.nodeName=${node#node/} \
    -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | wc -w)
  echo "$node: $pods pods"
done

Nodes reporting zero pods with GPU capacity are idle. Combine this with DCGM utilization metrics to identify nodes where a pod is running but the GPU is not being used.

How do I set up GPU autoscaling in Kubernetes with Karpenter?

Install Karpenter via Helm on EKS, then create a NodePool resource that specifies GPU instance requirements using the karpenter.k8s.aws/instance-gpu-name label. Set a limits block to cap total GPU allocation. When a pod requesting nvidia.com/gpu cannot be scheduled, Karpenter reads the NodePool requirements, selects the appropriate GPU instance, and provisions the node automatically. For full configuration, see the kubectl commands in the Karpenter section above.


Start Managing GPU Costs

The best tools for managing GPU usage in Kubernetes in 2025 and 2026 work as a layered stack: the GPU Operator for installation, DCGM for observability, time-slicing or MIG for utilization improvement, and Karpenter for capacity management. Each layer addresses a distinct failure mode — installation drift, blind spots in utilization, GPU waste from over-provisioning, and idle node cost from static capacity.

Running the idle node audit and DCGM alert setup described here typically surfaces immediate cost reduction opportunities in clusters that have not been actively instrumented.

To see how Clanker Cloud handles GPU operations queries across multi-cloud environments, book a demo or create a free account. For common questions about setup and supported providers, see the FAQ.

Canonical product page

Need the product-level answer?

Use the DevOps page for the canonical product answer on Kubernetes investigation, cost context, and local-first ops workflows.

Read the DevOps workflow page