The case for self-hosting language models is straightforward on paper: no per-token billing, data never leaves your network, and you can fine-tune or swap models freely. The reality is more complicated. GPU hardware, inference engineering, serving infrastructure, and ongoing maintenance are all real costs — and at low token volumes, they easily exceed what you would pay OpenAI or Anthropic.
This article works through the math. The crossover point exists and is calculable. Many engineering teams are on the wrong side of it — either over-paying for APIs they should replace with internal infrastructure, or spending $6,000/month on GPUs to run a workload that would cost $300/month via API.
What Hyperscaler API Costs Actually Include
The posted per-token rate is only part of the number. As of 2026, OpenAI GPT-5 is priced at approximately $2.50 per million input tokens and $10.00 per million output tokens. That asymmetry matters: if your workload generates long outputs (code generation, document drafting, multi-step reasoning), the effective cost skews sharply toward the output rate.
Several line items compound the headline rate:
Input/output asymmetry. Output tokens typically cost 4–10x more than input tokens depending on the model and provider. A workload that generates 30 output tokens per 100 input tokens faces a blended rate that is effectively 50–80% higher than the input rate alone.
Context window costs. A call using 100K tokens of context is billed for all 100K tokens, not just the new tokens added. Applications that pass long document contexts or maintain large conversation histories accumulate input costs quickly.
Burst pricing. High-tier rate limits (which enterprise traffic spikes require) come with higher pricing tiers at some providers. The baseline rate assumes smooth, predictable volume.
Compliance and data egress. Sending proprietary business data — code, contracts, customer records — to a third-party API endpoint is a compliance event. For regulated industries, this may require DPA agreements, geographic data residency controls, or is simply prohibited. That constraint removes the API option entirely, regardless of price.
Vendor dependency. GPT-4 was deprecated. Pricing changed multiple times. SLA guarantees on inference latency are soft. Teams that built hard dependencies on specific model versions have paid in migration time, not dollars, but it is still a real cost.
What Internal Model Costs Actually Include
Self-hosting is not just renting a server and running ollama pull. A production inference deployment has layers:
GPU hardware. An NVIDIA H100 SXM5 runs $30,000–$35,000 new; an A100 80GB is $10,000–$15,000 on the used market. Cloud GPU rental on AWS or GCP costs $3–$4/hour for H100 instances, adding up to roughly $2,200–$2,900/month for continuous single-GPU operation. Bare-metal providers like Hetzner offer 2x RTX 3090 configurations (AX102) at approximately €189/month — the most cost-efficient option for models in the 7B–31B parameter range with quantization.
Inference optimization. Raw model execution via Hugging Face's transformers library is not production throughput. Teams running serious inference workloads use vLLM, Text Generation Inference (TGI), or llama.cpp. Setting up, tuning, and maintaining these systems — kernel batching, KV cache sizing, tensor parallelism across multiple GPUs — requires engineering time. A reasonable estimate for initial setup is 1–2 weeks of senior engineering, and ongoing maintenance runs roughly 0.1 FTE (approximately $1,500–$2,000/month at a $180K fully-loaded salary).
Serving infrastructure. Production inference requires Kubernetes or equivalent orchestration, load balancing, health checks, autoscaling policies, and observability tooling. If this infrastructure doesn't already exist, it needs to be built and maintained.
Model selection and fine-tuning. Llama 3.3 70B, Gemma 4 31B (gemma4:31b), and Mistral Large 2 are capable models, but they need evaluation for your specific task distribution. Fine-tuning adds GPU time and data engineering cost.
Ongoing maintenance. CUDA driver updates, NCCL compatibility, model version updates, security patches to the serving stack — these are recurring costs that do not show up in a single month's bill but accumulate over a year.
The Crossover Calculation: Three Worked Examples
Scenario A: 50M tokens/month (small enterprise, internal tooling)
API cost estimate:
- 50M input tokens × $2.50/1M = $125
- 15M output tokens × $10/1M = $150
- Total: ~$275/month
Internal cost estimate:
- 1x Hetzner AX102 (2x RTX 3090): €189/month (~$205)
- Engineering overhead (0.1 FTE): ~$1,500/month
- Total: ~$1,705/month
Verdict: API wins. At this scale, internal infrastructure costs approximately 6–7x more than the API when engineering time is included. The break-even on hardware alone ignores the labor overhead that makes self-hosting viable. Teams at this scale who choose self-hosting are optimizing for data residency or control, not cost.
Scenario B: 500M tokens/month (mid-market, customer-facing AI feature)
API cost estimate:
- 500M input tokens × $2.50/1M = $1,250
- 150M output tokens × $10/1M = $1,500
- Total: ~$2,750/month
Internal cost estimate:
- 1x Hetzner AX102 with vLLM running Gemma 4 31B (
gemma4:31b): €189/month (~$205) - At this token volume, the server runs at meaningful utilization; amortized engineering setup over 12 months: ~$400/month
- Total: ~$605/month
Verdict: Internal wins. Roughly 4–5x cheaper at this volume. The key assumption is that Gemma 4 31B meets the quality threshold for the task. If it does, the math is decisive.
Scenario C: 5B tokens/month (large enterprise, multiple AI features)
API cost estimate:
- 5B input tokens × $2.50/1M = $12,500
- 1.5B output tokens × $10/1M = $15,000
- Total: ~$27,500/month
Internal cost estimate:
- 2x H100 bare metal (Hetzner or colocation): ~$3,500–$5,000/month
- Fully staffed inference team (0.25 FTE): ~$1,500/month
- Observability, infra tooling: ~$500/month
- Total: ~$5,500–$7,000/month
Verdict: Internal wins decisively. 4–5x savings monthly, or $250,000–$300,000/year. At this volume, dedicated inference infrastructure with a competent team is not optional — it is a business decision.
Open-Weight Model Performance in 2026
The cost math only works if the self-hosted model is good enough. In 2026, the gap between frontier API models and open-weight alternatives has narrowed substantially.
Gemma 4 31B (gemma4:31b): Google's Gemma 4 family runs the 31B variant on 2x RTX 3090 (24GB VRAM each) with 4-bit quantization at usable throughput. Performance on coding, instruction following, and structured output tasks is within 10–15% of GPT-4o on most benchmarks. For internal tooling, document analysis, and code review workflows, the gap is often imperceptible in practice. Lighter variants (gemma4:26b, gemma4:e4b) trade capability for throughput and run on single-GPU hardware.
Llama 3.3 70B: Meta's 70B model requires at least an A100 80GB or 2x RTX 4090 for reasonable inference throughput. It is the strongest general-purpose open-weight option for enterprises that need broad coverage across task types. Expect 20–40 tokens/second on A100 with vLLM.
Mistral Large 2: Competitive on instruction following and multilingual tasks. Available as both a self-hosted model and via Mistral's own API, which offers a middle path between full self-hosting and hyperscaler pricing.
Where frontier APIs still lead: Tasks requiring 200K+ token context windows, cutting-edge multimodal reasoning, and the latest GPT-5 or Claude Sonnet capabilities have not been replicated at open-weight scale. If your use case depends on reasoning over massive document sets or requires the absolute latest model capability, the API case remains strong regardless of cost.
The Hidden Costs of Getting Internal Inference Wrong
The scenarios above assume competent infrastructure execution. Many teams do not have that, and the math inverts quickly.
Under-provisioning is the most common failure mode. A 70B model running on hardware with insufficient VRAM causes out-of-memory kills under load. Inference queue depth grows. Latency SLAs break. The engineering team spends days debugging KV cache eviction policies instead of shipping product. The cost of that engineering time exceeds months of API spend.
Over-provisioning is equally expensive in the other direction. Idle H100s cost approximately $25–$30/day each at cloud rates, or roughly $750–$900/month per unused GPU. A cluster sized for peak traffic that runs at 20% average utilization is wasting most of its budget.
No observability is the silent cost multiplier. Not knowing which services are calling the model, with what token counts, and at what latency makes every optimization guess-work. Teams that skip inference observability often discover months later that a single runaway process was responsible for 40% of GPU utilization — or that their model was returning 3,000-token outputs when the application only used the first 100 tokens.
These failure modes are why the crossover point for well-run inference infrastructure (Scenarios B and C above) looks different from the crossover point for average internal deployments. Getting internal inference right requires investment in tooling and operational discipline that is not reflected in the raw hardware cost.
Clanker Cloud and BYOK: Making the Math Practical
If you have already decided to run internal models, Clanker Cloud's BYOK architecture lets you route AI workspace queries through your own models at zero additional token cost.
The workflow is direct: install Ollama, pull Gemma 4, and configure Clanker Cloud to use it as the agent model.
ollama pull gemma4:31b
Once configured, Clanker Cloud's infrastructure queries — checking cloud costs, analyzing Kubernetes cluster state, scanning for misconfigurations — run against your local Gemma 4 instance. No tokens leave your machine. No API calls to OpenAI or Anthropic unless you explicitly choose them.
For teams using Claude Code or Codex as the agent reasoning layer via MCP, the architecture is different: the agent layer makes API calls only for reasoning steps, while infrastructure data access happens through Clanker Cloud's local tooling. If you want to reduce that cost further, Hermes (also available as a BYOK option) provides strong function-calling capability with lower per-call overhead. See the Clanker Cloud documentation for BYOK configuration details.
This is the practical case for the 500M token/month tier: use cheap local Gemma 4 for structured infrastructure queries and routine analysis, reserve API calls for tasks that genuinely require frontier capability. The blended cost drops significantly compared to routing everything through a hyperscaler API.
For teams building AI into their development-to-production pipeline, the same BYOK model applies: local inference for the high-volume, lower-complexity steps; API models for the tasks where quality difference is measurable.
Deep Research: Analyzing Your Own AI Infrastructure Spend
One of Clanker Cloud's more immediately useful features for this decision is the deep research capability. You can run:
clanker ask "analyze my AI infrastructure spend — what am I paying for GPU, what am I paying for API calls, and where is the waste"
The agent swarm connects to all configured providers simultaneously — AWS, GCP, Azure, Kubernetes, Hetzner, Cloudflare, DigitalOcean — and surfaces GPU utilization, idle instance costs, and API call patterns from logs where accessible. It runs entirely on your machine; your cloud provider credentials and your Anthropic or OpenAI API keys never leave the device.
The output is a single report showing where your AI spend is concentrated and where the optimization opportunities are. For teams that have not audited their inference infrastructure recently, this report frequently surfaces idle GPUs, over-sized instances, or API call patterns that would justify the migration to self-hosted inference.
Visit /faq for more on how the deep research feature accesses and processes provider data.
Decision Framework
The crossover is not at a single number — it depends on your workload characteristics, engineering capacity, and operational maturity. But as a starting point:
| Monthly token volume | Recommendation |
|---|---|
| Under 100M | Use hyperscaler API. Internal infrastructure costs more including engineering time. |
| 100M–500M | Analyze your specific workload. Quality match, output ratio, and engineering capacity all affect the answer. |
| Over 500M | Self-hosted almost certainly wins. 4–6x cost savings are achievable with competent infrastructure. |
| Any volume with strict data residency | Self-hosted required regardless of cost comparison. |
| Cutting-edge reasoning or 200K+ context | Frontier API models. No open-weight equivalent at comparable quality. |
The 100M–500M band is genuinely ambiguous. A team with existing Kubernetes infrastructure, strong SRE capacity, and a workload that maps well to Gemma 4 31B quality should seriously evaluate self-hosting in this range. A team without that foundation should not — the hidden costs will flip the math.
FAQ
At what scale does self-hosting an LLM become cheaper than OpenAI API?
At approximately 500M tokens/month, self-hosted infrastructure on bare-metal hardware (such as Hetzner AX102 at ~€189/month) becomes meaningfully cheaper than OpenAI GPT-5 API pricing. Below that threshold, engineering overhead and infrastructure cost typically exceed API spend. Above 1B tokens/month, the savings are substantial enough that self-hosting is the default recommendation for most workloads.
What is the cheapest way to run Llama or Gemma 4 at enterprise scale?
Bare-metal rental from providers like Hetzner, combined with vLLM for serving and 4-bit quantized model weights, gives the best price-to-throughput ratio. For Gemma 4 31B (gemma4:31b), a 2x RTX 3090 configuration (Hetzner AX102) handles moderate enterprise throughput at under €200/month. For Llama 3.3 70B, an A100 80GB or 2x RTX 4090 is the minimum practical configuration. Cloud GPU rental is more expensive per token at consistent utilization but avoids upfront commitment and simplifies autoscaling.
What are the hidden costs of running internal AI models?
The three most significant hidden costs are: (1) engineering time for inference optimization and ongoing maintenance, often 0.1–0.25 FTE for a production deployment; (2) over-provisioned hardware that sits idle during off-peak hours; and (3) lack of observability tooling, which makes it impossible to identify wasteful call patterns or runaway processes. These costs are not reflected in hardware rental rates and frequently flip the cost comparison at lower token volumes.
How does Clanker Cloud's BYOK feature work for enterprise AI?
Clanker Cloud supports BYOK (bring your own key/model) at two levels. For local model inference, you configure an Ollama endpoint (running Gemma 4 via ollama pull gemma4:31b, for example) and Clanker Cloud routes agent queries through it — no external API calls, no tokens leaving your network. For API models, you supply your own OpenAI, Anthropic, or other API keys; they are stored locally and never transmitted to Clanker Cloud's servers. Both options are available starting from the free beta tier. Full configuration documentation is at docs.clankercloud.ai.
Next Steps
If you want to audit your current AI infrastructure spend before making an infrastructure decision, the deep research feature is the fastest starting point. If you are evaluating Clanker Cloud's BYOK capabilities in a team environment, the AI DevOps for teams overview covers the multi-user configuration and provider connection setup.
Request a demo to see the infrastructure cost analysis in a live environment, or create a free account and connect your first provider in under five minutes.
Run the cost check against your own infrastructure
Download the desktop app, keep credentials local, and ask Clanker Cloud to connect spend, topology, and recent changes across the providers you already use.
