Most teams running GPU workloads on Kubernetes deploy the NVIDIA device plugin, confirm that nvidia.com/gpu shows up as an allocatable resource, and call it done. That covers the minimum required to schedule GPU pods. It does not cover driver lifecycle management, metric collection, MIG profile automation, or inference throughput optimization — the four levers that actually move the cost-per-inference number.
NVIDIA built an answer to all four: the GPU Operator ecosystem, the DCGM exporter, and NIM (NVIDIA Inference Microservices) containers. Together they form a complete toolchain for NVIDIA Kubernetes cost optimization in 2026. This article walks through each component, shows the install commands, and builds a cost attribution model that connects raw GPU utilization percentages to actual dollar amounts per team.
Why Most Teams Stop Too Early
The NVIDIA device plugin exposes GPU capacity to the Kubernetes scheduler. That is its entire job. It does not manage driver updates across heterogeneous node pools, it does not export metrics to Prometheus, and it does not help you answer the question a platform engineering lead asks every month: "Which team is consuming the most GPU time, and what did it cost?"
Without DCGM metrics, that question is answered with CloudWatch billing tags at best and a spreadsheet at worst. Without NIM containers, inference throughput is whatever vanilla vLLM produces on a given model configuration — which leaves significant performance headroom unused, especially on H100 SXM5 hardware.
The GPU Operator installs and manages the full stack as a single Helm release. That is the starting point.
GPU Operator: One Helm Chart, Seven Components
The GPU Operator is a Kubernetes operator that manages the lifecycle of every NVIDIA software component a GPU node needs. Installing it replaces a manual process that previously required custom daemonsets, driver init containers, and ad-hoc toolkit configuration.
The operator manages seven components:
- NVIDIA driver daemonset — installs and upgrades GPU drivers on nodes without requiring a node image rebuild
- Container toolkit — configures the container runtime (containerd or CRI-O) to mount GPU devices into pods
- Device plugin — exposes
nvidia.com/gpuas a Kubernetes allocatable resource - DCGM exporter — collects GPU metrics and exposes them on a Prometheus endpoint
- MIG manager — applies MIG partition profiles to nodes via Kubernetes labels
- Node Feature Discovery (NFD) — detects hardware features and labels nodes
- GPU Feature Discovery (GFD) — adds GPU-specific labels (model, memory, CUDA version)
Install the operator with the following Helm commands:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set dcgmExporter.enabled=true \
--set mig.strategy=mixed
kubectl get pods -n gpu-operator
kubectl get nodes -o json | jq '.items[].status.allocatable | select(."nvidia.com/gpu" != null)'
The mig.strategy=mixed flag allows nodes in the same cluster to run different MIG profiles simultaneously — a prerequisite for shared inference clusters where different models have different memory footprint requirements.
Once the operator is running, every node with NVIDIA hardware automatically gets drivers, toolkit configuration, and metric collection without manual intervention. Driver upgrades become a Helm values update rather than a rolling node reimage.
DCGM Exporter: From GPU Utilization to Dollar Cost Attribution
The Data Center GPU Manager (DCGM) exporter runs as a daemonset pod on every GPU node. It pulls metrics directly from the NVIDIA management library and exposes them at /metrics on port 9400, ready for Prometheus to scrape.
Key metrics for NVIDIA Kubernetes cost optimization:
| Metric | What It Measures |
|---|---|
DCGM_FI_DEV_GPU_UTIL |
GPU compute utilization (%) |
DCGM_FI_DEV_MEM_COPY_UTIL |
Memory bandwidth utilization (%) |
DCGM_FI_DEV_FB_USED |
Framebuffer memory used (MB) |
DCGM_FI_DEV_POWER_USAGE |
Power draw (W) |
DCGM_FI_DEV_SM_CLOCK |
SM clock speed (MHz) |
Verify the exporter is running and scraping:
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter
kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400
curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
These metrics include pod-level labels when the exporter is configured with Kubernetes decorators, which means each metric carries the namespace, pod name, and container name of the workload consuming the GPU. That label set is what makes chargeback possible.
Cost Attribution Example: Multi-Team Shared GPU Cluster
Consider a shared inference cluster running on AWS g5.2xlarge instances. Each g5.2xlarge carries one A10G GPU and costs $1.006/hr on-demand.
DCGM reports the following average utilization over an 8-hour business day:
- Team A (recommendation model): 60% average
DCGM_FI_DEV_GPU_UTIL - Team B (text classification): 25% average
- Team C (image generation): 10% average
- Idle/overhead: 5%
GPU cost attribution per team per day:
- Team A: $1.006 × 0.60 × 8 = $4.83/day
- Team B: $1.006 × 0.25 × 8 = $2.01/day
- Team C: $1.006 × 0.10 × 8 = $0.81/day
Without DCGM metrics, this cluster shows up as a single $8.05/day line item in AWS Cost Explorer. With DCGM, each team sees its actual share. That visibility alone drives behavior change — teams that see their GPU cost per inference start right-sizing their resource requests and batching inference calls.
The same math scales to p4de.24xlarge instances (8× A100 80GB at $32.77/hr, or approximately $4.10/hr per GPU) for larger model deployments.
NIM Containers: The Economics of Pre-Optimized Inference
NIM (NVIDIA Inference Microservices) containers package a model together with a TensorRT-LLM engine pre-compiled for a specific GPU architecture. The result is a container that achieves hardware-near-optimal throughput without requiring the operator to build or tune the engine manually.
On H100 SXM5 hardware, Llama 3.3 70B NIM delivers approximately 2,400 tokens/sec. The same model on the same hardware with vanilla vLLM produces approximately 1,200 tokens/sec. The difference is the TensorRT-LLM backend — compiled for H100 tensor core utilization with quantization and pipelining configured by NVIDIA for that specific model-hardware pairing.
Deploy a Llama 3.3 70B NIM on Kubernetes:
helm repo add nim https://helm.ngc.nvidia.com/nim
helm install nim-llama33 nim/meta-llama3-3-70b-instruct \
--namespace nim \
--create-namespace \
--set model.ngcAPIKey=$NGC_API_KEY \
--set resources.limits."nvidia.com/gpu"=1
kubectl get pods -n nim
kubectl logs -n nim deployment/nim-llama33 -f
The NGC API key authenticates the container pull from the NVIDIA NGC registry. Once running, the NIM pod exposes an OpenAI-compatible /v1/completions endpoint, so existing application code requires no changes.
NIM vs Vanilla vLLM: Throughput and Cost Per Token
The throughput gap between NIM and vanilla vLLM translates directly to cost per inference token. Using a single H100 GPU from a p4de.24xlarge instance ($32.77/hr ÷ 8 GPUs = $4.10/hr per GPU):
NIM on H100:
Cost per token = $4.10/hr ÷ 2,400 tokens/sec ÷ 3,600 sec/hr
= $0.000000475/token
≈ $0.475 per million tokens
Vanilla vLLM on H100:
Cost per token = $4.10/hr ÷ 1,200 tokens/sec ÷ 3,600 sec/hr
= $0.00000095/token
≈ $0.95 per million tokens
For reference, GPT-4o API output pricing is approximately $0.0000025/token ($2.50 per million tokens). NIM self-hosted on H100 comes in at roughly one-fifth that cost at scale, and roughly 2× cheaper than vanilla vLLM on the same hardware.
The breakeven point for self-hosted inference versus API calls depends on utilization. At low request volumes, the API wins on operational simplicity. Above roughly 50 million tokens per day on a consistent basis, self-hosted NIM on H100 hardware becomes the lower-cost option — and at 100 million tokens per day, the cost delta becomes significant enough to justify the operational overhead of running the cluster.
Teams already invested in GPU Kubernetes infrastructure for training should evaluate NIM for inference workloads before defaulting to managed API endpoints. The marginal cost of adding a NIM deployment to an existing cluster is low; the throughput and cost-per-token improvement is not marginal.
MIG Manager: Automated Profile Management via GPU Operator
The GPU Operator includes a MIG manager that applies Multi-Instance GPU profiles through Kubernetes node labels. This eliminates the need to SSH into nodes and run nvidia-smi mig commands directly.
Apply a MIG profile to a node:
kubectl label node <node-name> nvidia.com/mig.config=all-1g.10gb
The MIG manager detects the label change and applies the all-1g.10gb profile, which partitions an A100 80GB into eight 10GB MIG slices. Each slice appears as an independent nvidia.com/mig-1g.10gb resource in the Kubernetes scheduler.
Verify active MIG instances:
kubectl get nodes -l nvidia.com/mig.config!=none -o json | \
jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia.com/mig")))'
MIG profiles are particularly valuable in shared inference clusters where multiple models have different memory requirements. A 7B model needs roughly 14GB; a 70B model in INT4 needs roughly 40GB. Running both on the same A100 80GB via MIG — a 3g.40gb slice for the 70B model and two 1g.10gb slices for 7B instances — improves physical GPU utilization compared to dedicating full GPUs to each workload.
The GPU Operator's MIG manager automates the profile application through GitOps-compatible node label changes, which integrates cleanly with existing Kubernetes cluster management workflows. For teams who have read about vibe coding to production approaches, the label-based workflow means MIG configuration can be expressed declaratively alongside other cluster state.
Clanker Cloud: GPU Cost Visibility in Plain English
Configuring the GPU Operator and DCGM exporter produces a rich stream of per-GPU, per-pod metrics in Prometheus. The operational question is how quickly a platform engineer can answer cost and utilization questions without writing PromQL queries against that data.
Clanker Cloud connects to your Kubernetes cluster — EKS, GKE, or AKS — and surfaces infrastructure state in plain English. With DCGM metrics in the cluster, queries like the following return actionable answers without opening Grafana:
- "Show me GPU utilization across all nodes in the inference cluster"
- "Which namespaces consumed the most GPU time in the last 24 hours?"
- "What is the estimated cost for the NIM deployment in the nim namespace this week?"
- "Are any GPU nodes running below 20% utilization during business hours?"
The Deep Research feature fans out across all connected providers — AWS, GCP, and Kubernetes simultaneously — and returns severity-graded findings. For GPU cost audits, this means a single query can surface underutilized g5 instances on AWS alongside over-provisioned GKE node pools without context-switching between consoles.
For AI DevOps teams managing shared GPU clusters with multiple inference workloads, this matters at 2am when a cost alert fires and the on-call engineer needs to identify the responsible deployment without navigating through four separate observability dashboards.
Clanker Cloud supports BYOK model configuration. Use Gemma 4 via Ollama (gemma4:31b) for fully local analysis where GPU cluster data should not leave the machine. Claude Code (claude-opus-4-6) and Codex provide stronger reasoning for complex cost attribution queries across multi-cluster environments. Hermes (hermes3:70b via Ollama) works well for structured output generation when you need cost reports in a specific format. All model keys stay local — they never transit through Clanker Cloud servers. See the full documentation for BYOK configuration.
For teams moving AI workloads from prototype to production Kubernetes clusters, the vibe coding to production guide covers the deployment patterns that complement the GPU Operator setup described here.
FAQ
What does the NVIDIA GPU Operator actually install on a Kubernetes node?
The GPU Operator deploys seven components as daemonsets or managed pods: the NVIDIA driver (avoiding the need to bake drivers into node images), the container toolkit (configures containerd or CRI-O to expose GPU devices), the device plugin (registers nvidia.com/gpu allocatable resources), DCGM exporter (Prometheus metrics), MIG manager (automated MIG profile application via node labels), Node Feature Discovery, and GPU Feature Discovery. All seven are managed by a single Helm release and updated together.
How does DCGM exporter enable GPU cost attribution per team?
DCGM exporter exposes per-GPU metrics with Kubernetes pod and namespace labels attached. When scraped by Prometheus, each metric time series carries the identity of the pod consuming the GPU. Aggregating DCGM_FI_DEV_GPU_UTIL by namespace over a billing period, then multiplying by the hourly instance cost and hours consumed, produces a per-team GPU cost breakdown. This is the foundation of GPU chargeback in multi-tenant shared clusters.
What is the performance difference between NIM and vanilla vLLM for inference on Kubernetes?
On H100 SXM5 hardware, Llama 3.3 70B NIM delivers approximately 2,400 tokens/sec compared to approximately 1,200 tokens/sec with vanilla vLLM. The difference comes from NIM's pre-compiled TensorRT-LLM engine, which is optimized for the specific model-hardware combination. This roughly 2× throughput improvement halves the cost per inference token on the same hardware.
When should a team use MIG versus full GPU allocation for inference workloads on Kubernetes?
Use MIG when running multiple models simultaneously on the same physical GPU and those models have memory footprints significantly smaller than the full GPU memory. For example, partitioning an A100 80GB into a 3g.40gb slice (for a 70B INT4 model) and four 1g.10gb slices (for concurrent 7B model instances) improves GPU utilization compared to allocating full A100s to each workload separately. Use full GPU allocation for single large models where the workload can saturate the full memory bandwidth, or when the model does not fit cleanly into available MIG slice sizes.
Next Steps
The GPU Operator, DCGM exporter, and NIM containers address different layers of the NVIDIA Kubernetes cost optimization stack. The Operator handles the operational complexity of managing GPU software on nodes. DCGM makes utilization measurable at the pod level. NIM improves the throughput ceiling before additional hardware is needed.
For teams managing multiple GPU node pools across AWS and GCP, the manual overhead of correlating DCGM metrics with cloud billing data across providers adds up. Clanker Cloud's Deep Research runs that correlation automatically — connecting Kubernetes metrics, node costs, and workload attribution into a single query result.
Start with the demo to see how GPU cluster queries work in practice, review common questions about Kubernetes and AI workload support, or read the for-agents documentation if you are integrating Clanker Cloud into an MCP-compatible agent workflow.
Create an account at clankercloud.ai/account to connect your first Kubernetes cluster. The GPU Operator setup documented here works with EKS, GKE, and AKS — connect whichever cluster is running your current GPU workloads and run the utilization query from there.
Need the product-level answer?
Use the DevOps page for the stable product answer on Kubernetes operations, cost context, and provider-aware investigation.
