Skip to main content
Back to blog

NVIDIA Kubernetes Cost Optimization in 2025 and 2026 — Stop Wasting GPU Budget

Merged into the canonical NVIDIA GPU operations guide to keep one stable page for Kubernetes cost optimization coverage.

Merged article

This topic now lives on one canonical page

This GPU cost-optimization article was merged into the canonical NVIDIA operator, DCGM, and NIM guide.

Read the canonical article

GPU compute is now the single largest line item for most ML and AI teams running on Kubernetes. Yet industry data consistently shows average GPU utilization sitting between 30 and 40 percent on Kubernetes clusters. The gap between what you allocate and what you actually use is not a GPU problem — it is a Kubernetes allocation problem, and it has a set of concrete, production-tested solutions in 2025 and 2026. This article walks through each one with real cost numbers, and shows how Clanker Cloud surfaces GPU waste automatically across every connected cluster.


The GPU cost problem is a Kubernetes allocation problem

An idle H100 on GKE costs roughly $30 per day. An idle GPU on AWS p5.48xlarge runs to approximately $4 per hour per GPU. When you multiply that across a team of ten ML engineers, each running their own training or inference workload, the idle cost compounds fast.

Three root causes drive most GPU waste on Kubernetes:

  1. Over-provisioning from SLA fear. Platform teams size GPU pods for worst-case traffic, not median usage. A model serving container requesting one full A100 might use 8% of its GPU capacity during off-peak hours.
  2. Batch training patterns. Training jobs consume GPU intensively during forward and backward passes but sit near zero utilization during data loading, checkpointing, and optimizer steps. The GPU is allocated continuously; utilization is intermittent.
  3. Binary allocation by default. The NVIDIA device plugin, in its default configuration, assigns whole GPUs to pods. A pod requesting nvidia.com/gpu: 1 gets an entire physical GPU, whether it uses 3% or 95% of it.

The 2025 and 2026 shift worth knowing: NVIDIA time-slicing and MIG (Multi-Instance GPU) are now production-stable. Fractional GPU allocation is no longer experimental — it is the expected baseline for inference workloads and multi-tenant environments.


How Kubernetes bills you for GPU

Cloud providers charge by node, not by GPU utilization. When a pod requests nvidia.com/gpu: 1, that GPU is marked as allocated and blocked from scheduling any other workload — even if that pod is consuming 3% of its compute capacity.

The gap between "GPU requested" and "GPU used" is precisely where budget disappears.

The key metrics to watch:

  • DCGM_FI_DEV_GPU_UTIL — actual GPU utilization (0–100), exposed by NVIDIA DCGM and scraped by Prometheus
  • nvidia.com/gpu requests in pod specs — what Kubernetes has allocated

When DCGM_FI_DEV_GPU_UTIL is consistently below 20 percent for a pod that holds an entire GPU allocation, you are paying for hardware that is mostly parked. The six tactics below fix that.


Tactic 1: GPU time-slicing

Time-slicing is software-level GPU sharing: multiple pods share one physical GPU, each receiving a scheduled time slice. No memory isolation between tenants, but workloads that do not need full GPU capacity run concurrently on the same hardware.

Implementation requires the NVIDIA GPU Operator and a ConfigMap that sets the replicas count:

# time-slicing-config ConfigMap
sharing:
    timeSlicing:
        resources:
            - name: nvidia.com/gpu
              replicas: 4

With replicas: 4, Kubernetes sees four nvidia.com/gpu resources on a node that has one physical GPU. Four separate pods can be scheduled against them simultaneously.

Cost math: A g5.2xlarge on AWS (1× A10G) runs $1.006/hr on-demand. Split four ways with time-slicing, each inference workload effectively costs $0.25/hr — compared to $1.006/hr if each claimed its own node.

Use time-slicing for:

  • Inference serving where per-request GPU usage is low
  • Development and testing environments
  • Batch jobs with low concurrent GPU demand

Do not use time-slicing for:

  • Training jobs that saturate GPU memory — time-slicing does not partition memory, and VRAM contention causes OOM errors or slowdowns
  • Multi-tenant environments where you need hard isolation guarantees — use MIG instead

Tactic 2: MIG (Multi-Instance GPU)

MIG is hard partitioning. An A100 or H100 can be divided into up to seven GPU instances, each with its own dedicated memory partition and compute engines. Unlike time-slicing, MIG instances cannot contend for each other's resources at runtime.

Partition example: 1× A100 80GB → 7× 1g.10gb MIG slices — seven isolated workloads on one physical GPU.

Cost math: An AWS p4de.24xlarge provides 8× A100s at $32.77/hr. With MIG enabled at the 1g.10gb profile, that becomes 56 isolated GPU instances. Effective per-instance cost: ~$0.58/hr. Without MIG, each A100 must be claimed by a single pod, pushing per-workload cost to $4.10/hr equivalent.

Use MIG for:

  • Multi-tenant platforms where SLA isolation per workload is required
  • Environments where you cannot risk one tenant's workload impacting another's latency
  • A100 and H100 hardware — MIG is not available on older GPUs

Configuring MIG is managed through the NVIDIA GPU Operator and requires selecting the correct MIG strategy (single or mixed) in the operator's ClusterPolicy.


Tactic 3: Spot and preemptible GPU nodes for training

Training jobs with checkpointing are the natural fit for spot instances. They tolerate interruption as long as the model checkpoint is written to durable storage (S3, GCS) at regular intervals.

Spot savings on GPU nodes:

  • AWS p3.2xlarge (V100): ~$0.92/hr spot vs. ~$3.06/hr on-demand — 70% cheaper
  • GKE spot A100 nodes: 60–70% cheaper than on-demand equivalents

Karpenter NodePool pattern — target spot first, fall back to on-demand:

spec:
    template:
        spec:
            requirements:
                - key: karpenter.sh/capacity-type
                  operator: In
                  values: ["spot", "on-demand"]

Karpenter provisions spot-first and automatically falls back when spot capacity is unavailable.

Checkpoint strategy: Save model state every N training steps to object storage. On interruption, the job restarts from the last checkpoint. For large models, set N conservatively — losing 10 minutes of training is acceptable; losing 8 hours is not.

Real cost example: Fine-tuning Llama 3.3 70B over 48 hours on 4× A100 on-demand costs approximately $6,300. The same job with spot instances comes to roughly $2,100 — a $4,200 difference on a single training run.


Tactic 4: Scale GPU nodes to zero

GPU nodes cost money even when no workloads are scheduled on them. Scale-to-zero removes idle GPU nodes from your account entirely when no pods are requesting GPU resources.

GKE: Node Auto-Provisioner with Autopilot mode scales GPU node pools to zero automatically when no GPU pods are scheduled. No manual configuration required.

EKS with Karpenter: Enable consolidation in the NodePool:

spec:
    disruption:
        consolidationPolicy: WhenUnderutilized
        consolidateAfter: 30s

This removes nodes that are running no non-daemonset pods, including empty GPU nodes after training jobs complete.

Savings example: A team that trains Monday through Friday, 9 to 5 (40 hours/week), saves approximately 65% compared to always-on GPU nodes. Two A100 nodes on GKE at $3.06/hr each run $1,484/month if always on. At 40 hours/week utilization, the monthly cost drops to approximately $244 — a savings of $1,240/month from a single configuration change.


Tactic 5: GPU bin-packing over spreading

The default Kubernetes scheduler uses a spreading strategy — it distributes pods across nodes to balance resource usage. For GPU workloads, this results in many GPU nodes with one or two pods each and low overall utilization. Each underloaded node still incurs full node cost.

Bin-packing schedules GPU pods tightly onto the fewest possible nodes, leaving other nodes empty so they can be scaled down.

Karpenter handles this natively through its consolidation logic. The WhenUnderutilized consolidation policy identifies nodes where workloads can be migrated to other nodes, evicts them, and removes the node from the cluster.

For custom scheduling requirements, the GPU-aware scheduler extender approach allows bin-packing logic that accounts for GPU memory capacity alongside CPU and RAM — useful when workloads have varied GPU memory requirements that the default resource accounting does not capture well.

Expected savings from bin-packing with consolidation: 15–30% reduction in GPU node count for clusters running mixed-size GPU workloads.


Finding GPU waste with Clanker Cloud

The manual approach to GPU cost analysis looks like this:

kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, gpu:.status.capacity["nvidia.com/gpu"]}'
kubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.containers[].resources.requests["nvidia.com/gpu"] != null)'

Then cross-reference pod GPU requests against DCGM utilization metrics in Prometheus, map those to namespaces, identify underutilized nodes, and calculate potential savings. Across multiple clusters, this process takes hours and produces a snapshot that is stale before you finish writing the report.

Clanker Cloud is a local-first AI workspace for infrastructure — a desktop application that connects to your live Kubernetes clusters and cloud accounts, keeps credentials on your machine, and lets you query your actual infrastructure in plain English.

For GPU cost analysis, the workflow looks different:

  • "What are my most underutilized GPU nodes across all clusters right now?"
  • "Show me pods that have requested GPU resources but used less than 10% GPU utilization in the last 24 hours"
  • "What would I save this month if I switched my training jobs to spot GPU nodes?"
  • "Which namespaces are over-provisioning GPU requests relative to actual usage?"

Each query runs against live infrastructure data — not synthetic estimates or static configuration files. Clanker Cloud pulls from DCGM metrics, node capacity APIs, and cloud billing data simultaneously and returns an answer grounded in what is actually happening in your cluster.

For a more thorough audit, Deep Research fans out across every connected provider, runs parallel analysis with multiple AI subagents, and returns prioritized findings — idle GPU nodes, pods with near-zero utilization holding full GPU allocations, missing time-slicing configuration on inference nodes, and spot savings opportunities — all exported as structured Markdown or JSON.

If you are also managing non-GPU infrastructure concerns — deployment pipelines, Helm releases, incident response — the same workspace applies. See AI DevOps for Teams and Vibe Coding to Production for how platform teams are using Clanker Cloud beyond cost.

For teams embedding Clanker Cloud into automated pipelines and agent workflows, the for-ai-agents.md page covers the MCP integration.


GPU cost attribution — who is spending what

Before you can enforce GPU budget discipline, you need visibility into which teams and projects are responsible for which spend.

The standard approach:

  • Use Kubernetes namespaces as team/project boundaries
  • Label pods consistently: team: ml-research, project: llm-finetune
  • Use Kubecost or OpenCost to generate GPU cost breakdowns by namespace and label

In Clanker Cloud: "Show me GPU cost breakdown by namespace for the last 30 days across all my clusters" returns a structured view without requiring a separate FinOps tool.

The practical impact of show-back (visibility without enforcement) is measurable: teams that see their GPU spend attributed to their namespace and project typically reduce waste by 20–30% before any policy changes are implemented. Awareness alone changes behavior.

Chargeback — actually billing teams for their GPU usage — drives further optimization but requires organizational process changes beyond tooling. Start with show-back.


Priority order for NVIDIA GPU cost optimization

Priority Tactic Typical savings Effort
1 Spot/preemptible for training 60–70% on training Medium
2 Scale GPU nodes to zero 50–65% for batch teams Low
3 Time-slicing for inference 50–75% per GPU Low
4 MIG for multi-tenant Isolation + density Medium
5 Bin-packing with Karpenter 15–30% node reduction Medium
6 Cost attribution + show-back 20–30% waste reduction Low

Start with spot for training and scale-to-zero — both require minimal code changes and deliver immediate savings. Time-slicing for inference workloads is a low-effort ConfigMap change that can cut per-inference GPU cost by 50–75%. MIG and bin-packing require more configuration but matter significantly in multi-tenant or high-density environments.

Cost attribution is the ongoing foundation. You cannot optimize what you cannot see.


FAQ

How do I reduce GPU costs on Kubernetes?

The highest-impact steps in order: switch training jobs to spot/preemptible GPU nodes (60–70% savings), configure scale-to-zero for GPU node pools so idle nodes are removed (50–65% savings for batch teams), and enable time-slicing on inference nodes so multiple pods share one GPU (50–75% per-GPU cost reduction). Use Karpenter with consolidationPolicy: WhenUnderutilized to handle the scheduling and node lifecycle automatically.

What is GPU time-slicing in Kubernetes and how does it save money?

GPU time-slicing, configured via the NVIDIA GPU Operator, allows multiple Kubernetes pods to share a single physical GPU by dividing it into N virtual GPU resources. A node with one A10G can appear to Kubernetes as having four nvidia.com/gpu resources, scheduling four inference pods instead of one. This reduces per-pod GPU cost proportionally — one g5.2xlarge at $1.006/hr serves four workloads at an effective $0.25/hr each. Time-slicing does not partition GPU memory, so it is suited for inference workloads with modest VRAM requirements, not training jobs.

Should I use spot GPU instances for Kubernetes training jobs?

Yes, if your training jobs implement checkpointing. Spot GPU instances on AWS are 60–70% cheaper than on-demand for comparable hardware (e.g., p3.2xlarge at ~$0.92/hr spot vs. ~$3.06/hr on-demand). Saving model state to S3 or GCS every N steps means a spot interruption results in at most N steps of lost work. Karpenter's spot-first NodePool configuration with on-demand fallback handles provisioning automatically. For long training runs, spot savings typically outweigh the occasional restart cost significantly.

How do I find idle GPU nodes in Kubernetes?

The core metric is DCGM_FI_DEV_GPU_UTIL from NVIDIA DCGM, scraped into Prometheus. A Prometheus query like avg_over_time(DCGM_FI_DEV_GPU_UTIL[24h]) < 10 surfaces GPUs that have been near-idle over the last 24 hours. Cross-referencing with kube_pod_container_resource_requests{resource="nvidia_com_gpu"} shows which pods hold those GPU allocations. Alternatively, Clanker Cloud does this across all connected clusters in a single query without Prometheus query construction — ask "Which GPU nodes have had less than 10% utilization in the last 24 hours?" and get structured results against live data.


Get started

Request a demo to see Clanker Cloud query live GPU utilization and cost attribution across your clusters.

Create an account — setup takes under a minute, and credentials stay on your machine.

Have questions about GPU monitoring setup or MIG configuration? The FAQ and docs cover common Kubernetes GPU operator configurations.

Next step

Ask Clanker Cloud what your cluster is doing

Install the local app, connect your kubeconfig, and turn cluster state, workload health, cost context, and safe next steps into one readable answer.

Download Clanker CloudRead canonical article