Skip to main content
Back to blog

Best Tools for Managing GPU Usage in Kubernetes in 2026

Merged into the canonical Kubernetes GPU management guide to keep one stable tools page for the topic.

Merged article

This topic now lives on one canonical page

This narrower 2026-only tools list was merged into the canonical GPU management guide with the fuller operator and autoscaling coverage.

Read the canonical article

GPU clusters are expensive. GPU clusters that sit idle are more expensive. And yet, across production ML environments in 2026, average GPU utilization hovers between 30 and 40 percent — a figure consistent with industry observations from teams running training and inference at scale. The remaining 60 to 70 percent is largely waste: batch training jobs that leave GPUs idle between epochs, inference services over-provisioned to hit SLA targets, and scheduling decisions made without visibility into actual device utilization.

The math is blunt. A single idle H100 on GKE costs roughly $30 per day at on-demand rates. A cluster of 32 H100s running at 35% utilization is burning through approximately $624 per day in unused capacity. At that scale, "managing GPU usage" stops being an operational nicety and becomes a meaningful budget item.

Managing GPU usage in Kubernetes covers five distinct problems: allocation (which pods get GPUs and how), scheduling (where workloads land), monitoring (what the devices are actually doing), right-sizing (requesting only what workloads need), and chargeback (who is responsible for what spend). Each layer has dedicated tooling, and most production environments need all five.

This article walks through that stack layer by layer, then shows how Clanker Cloud connects them into a single queryable interface.


Layer 1: Allocation — NVIDIA Device Plugin

The starting point for any GPU workload on Kubernetes is the NVIDIA Device Plugin. Deployed as a DaemonSet, it registers nvidia.com/gpu as a schedulable Kubernetes resource, making GPUs visible to the scheduler the same way CPU and memory are. Without it, pods cannot request GPU capacity at all.

The default behavior is whole-GPU allocation. A pod requesting nvidia.com/gpu: 1 gets an entire physical GPU — even if it only uses 5% of its compute capacity. For many inference serving workloads, this is pure waste.

Time-slicing addresses this by allowing multiple pods to share a single physical GPU through temporal multiplexing. Configuration is done in a ConfigMap passed to the Device Plugin:

version: v1
sharing:
    timeSlicing:
        resources:
            - name: nvidia.com/gpu
              replicas: 4

With replicas: 4, a single GPU appears as four schedulable units. Pods sharing the device take turns accessing it in time slices. There is no memory isolation — all pods share the full VRAM — which means this approach is appropriate for inference serving with lightweight models, not for multi-tenant training where one job's memory footprint cannot be allowed to affect another.

MIG (Multi-Instance GPU) is the alternative for A100 and H100 hardware. It partitions a single physical GPU into up to 7 isolated instances, each with its own dedicated compute and memory. A 7-instance MIG partition on an A100 80GB gives each tenant roughly 10GB of dedicated VRAM with hard isolation between workloads. The tradeoff: MIG requires pre-configuring partition profiles before workloads run, and reconfiguring MIG profiles requires draining the node.

The practical decision rule: use time-slicing for inference serving where jobs are short-lived and memory-light; use MIG for multi-tenant training environments where isolation between jobs is a hard requirement.


Layer 2: Monitoring — DCGM Exporter + Prometheus

Knowing that GPUs are allocated is not the same as knowing they are being used. The standard Kubernetes GPU monitoring stack in 2026 is DCGM Exporter feeding Prometheus, with Grafana for visualization.

DCGM (Data Center GPU Manager) is NVIDIA's official telemetry and management library. The DCGM Exporter runs as a DaemonSet and exposes per-GPU metrics in Prometheus format. The three metrics every GPU monitoring setup should track:

  • DCGM_FI_DEV_GPU_UTIL — GPU compute utilization as a percentage (0–100)
  • DCGM_FI_DEV_MEM_COPY_UTIL — Memory copy engine utilization, a proxy for data transfer activity
  • DCGM_FI_DEV_FB_USED — Framebuffer (VRAM) used in megabytes

A minimal alert rule to catch idle GPUs:

groups:
    - name: gpu-idle
      rules:
          - alert: GPUIdle
            expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 5
            for: 30m
            labels:
                severity: warning
            annotations:
                summary: "GPU {{ $labels.gpu }} on node {{ $labels.node }} has been under 5% utilization for 30 minutes"

This catches the most common case: a training job that crashed or finished without releasing its GPU reservation.

GKE Managed Prometheus has DCGM pre-integrated through the GKE GPU telemetry stack — enabling it is a cluster-level flag, no DaemonSet management required. EKS and AKS require manual DCGM Exporter deployment and Prometheus scrape configuration. The setup is straightforward but adds operational surface area that managed offerings eliminate.

For teams also instrumenting application-level infrastructure, DCGM fits naturally into the same Prometheus stack used for CPU and memory monitoring. If you are already running Prometheus for general cluster monitoring, adding DCGM scraping is a configuration change, not a new system.


Layer 3: Visibility — Node Feature Discovery + GPU Feature Discovery

Scheduling the right workload to the right node requires accurate node labeling. Node Feature Discovery (NFD) and GPU Feature Discovery (GFD) work together to label Kubernetes nodes with hardware and driver metadata at startup.

GFD applies labels including:

  • nvidia.com/gpu.product — GPU model (e.g., A100-SXM4-80GB)
  • nvidia.com/cuda.runtime.major / nvidia.com/cuda.runtime.minor — CUDA runtime version
  • nvidia.com/driver.major — Driver major version

Without these labels, a pod compiled against CUDA 12.x can land on a node running CUDA 11.x and fail at runtime — a silent scheduling failure that only surfaces in logs after a job has been queued and started. With GFD-applied labels, node affinity rules can enforce nvidia.com/cuda.runtime.major: "12" as a scheduling constraint, catching the mismatch before the pod starts.

This becomes particularly important in heterogeneous clusters running mixed GPU generations across node pools — a common configuration in 2026 as teams upgrade hardware incrementally without replacing entire clusters.


Layer 4: Right-Sizing and Chargeback

Allocation and monitoring tell you what is happening. Right-sizing changes what happens next.

The Vertical Pod Autoscaler (VPA) does not natively understand GPU resources. It handles CPU and memory, but nvidia.com/gpu is outside its resource model. GPU right-sizing in Kubernetes is therefore still a manual or semi-automated process in most clusters. The practical approach:

  1. Collect DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_GPU_UTIL over a 7-day rolling window
  2. Calculate p95 utilization per workload
  3. Adjust resources.requests and resources.limits in pod specs to match p95 plus a headroom buffer (typically 20%)

This removes systematic over-provisioning without exposing workloads to resource contention.

Kubecost adds the financial layer. Its GPU cost allocation feature maps GPU hours consumed per namespace, label set, or team to dollar amounts, using real-time cloud pricing data. This enables show-back reports: which team consumed what GPU capacity, at what cost, over a given period. Show-back reporting drives behavior change without enforcement — teams with cost visibility consistently right-size more aggressively than teams without it.

For teams also working on the broader AI DevOps for teams problem, chargeback visibility at the namespace level is typically the first lever that moves GPU utilization numbers meaningfully upward.


Layer 5: Autoscaling GPU Node Pools

Right-sizing individual pods addresses utilization at the workload level. Autoscaling addresses it at the node level — scaling capacity to match actual demand rather than holding idle nodes in reserve.

The Cluster Autoscaler supports GPU node groups but is slow by design. Adding a new node typically takes 3 to 5 minutes: the CA detects unschedulable pods, requests a new node from the cloud provider, waits for it to join the cluster, and allows time for the GPU driver and device plugin to initialize. For latency-sensitive inference workloads, this cold-start cost is often unacceptable.

Karpenter (AWS) is faster and more GPU-aware. It performs bin packing at provisioning time — analyzing pending pod GPU requests and selecting instance types that minimize wasted capacity. Combined with spot-first provisioning, Karpenter can reduce GPU node costs by 60–70% for workloads tolerant of spot interruption, primarily batch training. For a detailed comparison of Kubernetes autoscaling options in ML environments, the FAQ covers the common decision points.

KEDA (Kubernetes Event-Driven Autoscaling) handles the inference autoscaling case that Cluster Autoscaler and Karpenter do not: scaling GPU pods based on queue depth rather than CPU or memory. An inference service consuming jobs from an SQS queue or Redis stream can scale its GPU pod count proportionally to queue length, keeping individual GPU utilization high by distributing load across exactly as many pods as the backlog justifies — no more, no less.


Clanker Cloud as the Management Layer

Individual tools handle individual layers. The operational problem in 2026 is not a shortage of GPU monitoring tools — it is the absence of a unified interface to query across all of them simultaneously.

Clanker Cloud is a local-first AI workspace for infrastructure that connects to GKE, EKS, AKS, and bare-metal Kubernetes clusters. Once connected, the full GPU toolchain becomes queryable in natural language:

clanker ask "show GPU utilization across all nodes for the last 7 days"
clanker ask "which namespaces are over-provisioning GPU memory relative to actual usage"
clanker ask "show me all nodes with DCGM alerts firing right now"

These queries reach into Prometheus metrics, Kubernetes resource specs, and node labels simultaneously — returning a single answer instead of requiring separate context-switching across Grafana dashboards, kubectl commands, and cost reports.

For write operations, Maker mode plans and previews changes before execution:

clanker ask --maker "update GPU time-slicing config to 8 replicas on inference nodes"

The --maker flag generates a plan. The --apply flag executes it. Destructive operations require --destroyer, making the intent explicit in the command and auditable in logs.

For teams using offline or air-gapped environments, BYOK support means Clanker Cloud can run entirely on local models. Gemma 4 via Ollama (gemma4:31b, gemma4:26b, or gemma4:e4b) runs locally without external API calls. For teams preferring hosted frontier models, Claude Code and Codex are available via MCP. No credentials leave the machine regardless of model choice — the local-first architecture is consistent across all BYOK configurations.

Clanker Cloud is also relevant for teams working on vibe coding to production workflows where infrastructure changes need to be agent-driven rather than manually scripted.


Deep Research: Full GPU Infrastructure Audit

Single-question queries address known problems. The Deep Research feature in Clanker Cloud addresses unknown ones.

clanker ask "run a deep scan of my GPU infrastructure — idle nodes, over-provisioned pods, scheduling failures, driver version mismatches"

This triggers an agent swarm that runs simultaneously across all connected providers. For a Kubernetes cluster with GPU workloads, it checks: DCGM metric anomalies, node-level GPU allocations vs. actual utilization, pending pods with unsatisfied GPU requests, nvidia-device-plugin DaemonSet health, NFD/GFD label consistency, and per-namespace cost attribution. The result is a single structured report covering all findings across all providers — GKE, EKS, AKS, and bare-metal — without requiring separate dashboards for each.

Everything runs on the user's machine. No credentials are transmitted to external services. The architecture is described in detail at /for-ai-agents.md for teams evaluating agent-driven infrastructure tooling.

For teams managing multi-cloud GPU fleets, this eliminates the most time-consuming part of GPU cost optimization: manually correlating data from five or six separate monitoring systems to find which clusters have the most recoverable waste. See the Clanker Cloud documentation for configuration details.


Tool Comparison

Tool Layer What It Solves Setup Effort GKE EKS AKS
NVIDIA Device Plugin Allocation Exposes GPUs as K8s resources; time-slicing and MIG support Low (Helm chart) Native Native Native
DCGM Exporter Monitoring Per-GPU utilization, memory, and health metrics for Prometheus Medium (DaemonSet + scrape config) Pre-integrated Manual Manual
NFD + GFD Visibility Node labeling for GPU model, CUDA version, driver version Low (Helm chart) Manual Manual Manual
Kubecost Chargeback GPU cost allocation by namespace, team, and label Medium (agent deployment) Supported Supported Supported
Karpenter Autoscaling GPU-aware bin packing and spot provisioning Medium (IAM + Helm) Limited Native No
Clanker Cloud Management layer Unified query interface across all layers and providers Low (CLI + connect) Supported Supported Supported

FAQ

How do you monitor GPU utilization in Kubernetes?

Deploy DCGM Exporter as a DaemonSet alongside Prometheus. DCGM Exporter surfaces per-GPU metrics including DCGM_FI_DEV_GPU_UTIL (compute utilization), DCGM_FI_DEV_MEM_COPY_UTIL (memory transfer activity), and DCGM_FI_DEV_FB_USED (VRAM consumption). Prometheus scrapes these metrics and makes them queryable in Grafana or via PromQL alert rules. GKE Managed Prometheus includes DCGM integration out of the box; EKS and AKS require manual setup.

What is the difference between GPU time-slicing and MIG in Kubernetes?

Time-slicing allows multiple pods to share a single physical GPU through temporal multiplexing — the GPU services one pod at a time in rapid succession. There is no memory isolation; all pods share the full VRAM. MIG (Multi-Instance GPU), available on A100 and H100 hardware, creates physically partitioned GPU instances with dedicated compute and dedicated memory. Use time-slicing for lightweight inference workloads where memory sharing is acceptable. Use MIG for multi-tenant training environments where workload isolation is a hard requirement.

How do you right-size GPU requests in Kubernetes pods?

The standard process: collect DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_GPU_UTIL over a 7-day rolling window using Prometheus, calculate p95 utilization per workload, and update resources.requests and resources.limits in pod specs to match p95 plus approximately 20% headroom. VPA does not support GPU resources natively, so this process remains manual or requires custom automation built on top of Prometheus query results.

How do you reduce GPU waste in Kubernetes clusters?

GPU waste reduction in Kubernetes has three levers. First, enable time-slicing or MIG partitioning on the NVIDIA Device Plugin to allow multiple workloads to share physical GPUs rather than holding whole devices idle. Second, set DCGM_FI_DEV_GPU_UTIL < 5% alert rules in Prometheus to catch idle GPU reservations within 30 minutes. Third, use Kubecost GPU cost allocation to surface which namespaces and teams are consuming GPU capacity relative to their actual utilization, and use show-back reports to drive right-sizing behavior across teams.


Get Started

The toolchain described in this article — Device Plugin, DCGM Exporter, NFD/GFD, Kubecost, Karpenter — handles individual layers. Connecting them into a single view is where most GPU cost optimization efforts stall.

Clanker Cloud installs in under two minutes and connects to existing Kubernetes clusters without changes to your monitoring stack:

brew tap clankercloud/tap && brew install clanker

View a live demo at /demo or create an account at clankercloud.ai/account. Beta access is currently free.

Next step

Ask Clanker Cloud what your cluster is doing

Install the local app, connect your kubeconfig, and turn cluster state, workload health, cost context, and safe next steps into one readable answer.

Download Clanker CloudRead canonical article