Skip to main content
Back to blog

AI Agent Infrastructure Context: Closing the Gap Between Training Data and Live State

AI agents fail at infrastructure tasks without live context. Learn the five context types agents need and how Clanker Cloud's MCP workspace provides them.

An AI agent asked to write a deployment script knows Kubernetes syntax, understands Helm charts, and can generate valid YAML from memory. What it does not know is anything real about your infrastructure right now.

It does not know that your prod cluster has 6 nodes, 3 of which are at 85% memory utilization. It does not know that checkout-api currently has 3 replicas responding at 22ms p95, while billing-worker has 4 replicas averaging 3% CPU. It does not know that orders-postgres is at 2,100 queries per second and approaching a connection limit, or that session-cache is DEGRADED because redis is under hot key pressure.

So the agent writes a script that deploys 2 more billing-worker pods. It's well-formed, it follows best practices, and it will make your infrastructure measurably worse. The idle service gets more capacity; the actual bottleneck gets none.

This is the context gap: the difference between what an agent was trained on — general infrastructure knowledge, documentation, conventions — and the current state of your actual running systems. The context gap is not a model quality problem. It is an information problem. And it is the root cause of most AI automation failures on real infrastructure.


The Five Types of Infrastructure Context Agents Need

Agents operating on infrastructure without live context are not hallucinating. They are reasoning correctly from incomplete data. The solution is not a smarter agent — it is giving the agent the information it needs to reason about your specific environment.

There are five distinct types of context that matter:

1. Topology Context

What services exist in this environment. What talks to what. Which services are upstream of which. Which are synchronous dependencies versus async consumers.

Without topology context, an agent that decides to restart orders-postgres to clear a connection backlog does not know that three services — checkout-api, orders-api, and billing-worker — are actively querying it, with checkout-api being the synchronous dependency that will immediately start returning 500s.

2. State Context

What is healthy right now. What is degraded. What changed in the last hour. State context is not static documentation — it is a live snapshot that goes stale in minutes.

An agent checking whether it is safe to scale down a pod needs to know whether that pod's health checks are passing, whether there are active connections draining, and whether anything upstream has flagged it as degraded.

3. Cost Context

What resources are running and what each one costs per month. Which are idle versus under load. Where the billing concentration is.

Without cost context, a scaling decision that looks operationally neutral might be financially significant in the wrong direction. An agent recommending scale-down for billing-worker needs to know it costs $31/month running 4 replicas at 3% CPU — and that scaling to 2 replicas saves real money with no service risk.

4. Config Context

Current resource specs, limits, replica counts, and instance types — not the values in a Terraform file from three months ago, but what is actually deployed and running right now.

An agent writing a Terraform module for the production cluster that defaults to t3.large because that is the most common instance type in its training data will generate an incorrect configuration if prod is running g5.2xlarge nodes. The gap between the training-data default and the live config is where the bug lives.

5. History Context

What changed in the last N hours or days. Recent deployments, recent incidents, recent config modifications. History context lets an agent correlate cause and effect rather than treating every anomaly as unexplained.

An incident response agent that can see "a new version of checkout-api was deployed 14 minutes before the latency spike began" can generate an accurate root cause analysis. Without history context, it can only describe the current symptom.


How Clanker Cloud Provides Each Context Type via MCP

Clanker Cloud runs as a local MCP server that any compatible agent can register against. The primary MCP tool — clanker_route_question — accepts natural language queries and routes them to the appropriate provider APIs, returning structured live data.

Start the MCP server with:

clanker mcp --transport http --listen 127.0.0.1:39393

For agents using stdio transport:

clanker mcp --transport stdio

Here is how each context type maps to a concrete MCP interaction:

Topology Context via MCP

Agent query:    clanker_route_question("what services talk to orders-postgres?")
Answer:         "checkout-api (3 pods, 22ms p95, synchronous),
                 orders-api (4 pods, gRPC, synchronous),
                 billing-worker (2 replicas, async queue consumer)"

Agent decision with context:   "I should not restart orders-postgres during checkout-api
                                 peak hours. Recommend off-hours maintenance window."
Agent decision without context: "Restart orders-postgres to clear the connection pool."
                                 [will cause immediate checkout-api failures]

State Context via MCP

Agent query:    clanker_route_question("what services are currently degraded in production?")
Answer:         "session-cache DEGRADED — redis under hot key pressure.
                 All other services healthy. checkout-api latency elevated
                 (22ms p95, up from 9ms baseline) as a downstream effect."

Agent decision with context:   "Root cause is redis, not checkout-api.
                                 Recommend investigating hot key distribution."
Agent decision without context: "checkout-api appears slow. Recommend scaling up."
                                 [treats symptom, not cause]

Cost Context via MCP

Agent query:    clanker_route_question("what is the monthly cost and CPU utilization
                                        of billing-worker?")
Answer:         "billing-worker: 4 replicas, $31/mo, CPU avg 3% over 30 days,
                 memory avg 12%. Flagged HIGH by Deep Research:
                 scale down or enable HPA — save ~$140/mo."

Agent decision with context:   "billing-worker is the correct scale-down target.
                                 Recommend reducing to 2 replicas or enabling HPA
                                 with minReplicas: 1."
Agent decision without context: "billing-worker appears to have headroom.
                                 Scale up checkout-api instead."
                                 [misses the cost optimization signal entirely]

Config Context via MCP

Agent query:    clanker_route_question("what instance type is the production
                                        cluster running?")
Answer:         "prod-cluster: 6 nodes, g5.2xlarge, us-east-1.
                 3 nodes at 85%+ memory utilization."

Agent decision with context:   Generates Terraform with correct instance type,
                                flags memory pressure as a follow-up concern.
Agent decision without context: Generates Terraform defaulting to t3.large.
                                 Config drift from the first apply.

History Context via MCP

Agent query:    clanker_route_question("what changed in production in the
                                        last two hours?")
Answer:         "checkout-api v2.4.1 deployed at 14:32 UTC.
                 orders-postgres connection count increased from 180 to 340
                 starting at 14:34 UTC. No other deployments in window."

Agent decision with context:   Correlates deployment with connection spike.
                                Generates RCA: v2.4.1 likely introduced a
                                connection pool regression.
Agent decision without context: "orders-postgres is at elevated connections.
                                  Recommend scaling database tier."
                                 [treats symptom, misses the deployment cause]

Agent Patterns That Use Context Well

For practical guidance on using AI agents with live infrastructure data, these four patterns show what context-aware agent behavior looks like across different tools.

Claude Code Mid-Session Config Generation

A developer working in Claude Code is writing a Terraform module for a new service that will deploy into the existing production cluster. Before generating the module, Claude Code calls clanker_route_question("what instance type and node count is prod-cluster using?") and gets back g5.2xlarge, 6 nodes, 3 at 85% memory. It generates the correctly-sized compute spec and adds a note that the cluster is memory-constrained — a new deployment should include explicit memory limits to avoid contributing to the pressure.

Without live context, Claude Code would generate a module with a t3.large default or whatever the documentation example showed. The operator would catch it during terraform plan — or not until the first apply.

OpenClaw HEARTBEAT.md Health Monitoring

OpenClaw's HEARTBEAT.md pattern runs an autonomous task checklist every 30 minutes. Registered against the Clanker Cloud MCP workspace (openclaw mcp set clanker-cloud --url http://127.0.0.1:39393), the HEARTBEAT task queries current service health before deciding whether to alert.

Without live context, the agent is comparing current state against stale training data. It cannot distinguish a genuine new degradation from a state it already knew about. With live context, it queries clanker_route_question("show me all degraded services in production") and only fires an alert if the returned state differs from the previous check. This eliminates false alerts from stale baseline assumptions.

Hermes Incident Response Correlation

Hermes (hermes3:70b via Ollama, MIT license) is well-suited for agentic tool use workflows. In an incident response pattern, Hermes queries both history context ("what changed in the last hour?") and state context ("what is currently degraded?") via MCP, then correlates the two.

The query clanker_route_question("show me deployments in the last two hours and current pod restarts") returns both the deployment timestamp and the pod restart timeline. Hermes can generate an accurate root cause analysis — "checkout-api v2.4.1 deployment at 14:32 correlates with orders-postgres connection spike at 14:34" — and file the incident ticket before a human has opened the first dashboard.

Codex Scaling Script Generation

Codex, asked to write a horizontal pod autoscaler configuration for checkout-api, queries clanker_route_question("what is the current HPA ceiling for checkout-api?") before generating any YAML. If the current configuration has maxReplicas: 10 and the cluster is already at 85% memory on 3 of 6 nodes, Codex writes a script that respects that ceiling and flags the memory constraint rather than setting an unconstrained scale target that will trigger node-level OOM events.


The Context → Decision → Action Pipeline

Context does not eliminate the need for human oversight — it makes human oversight meaningful. The pipeline in Clanker Cloud is:

Agent: query live context via MCP
            ↓
Agent: makes decision based on actual current state
            ↓
Agent: generates a plan (explicitly states what it intends to do and why)
            ↓
Clanker Cloud: presents plan for operator review
            ↓
Operator: approves
            ↓
Maker Mode: executes

The plan step is where live context makes the biggest difference. An agent with context generates a plan that references real data: "Scale billing-worker from 4 to 2 replicas. Current CPU avg: 3% over 30 days. Current cost: $31/month. Estimated saving: ~$15/month. Risk: LOW — no downstream synchronous consumers." An agent without context generates: "Scale billing-worker to match predicted load." The operator reviewing the second plan has no basis for judgment. The first plan is reviewable.

This is the pattern covered in depth in the vibe coding to production guide: AI-assisted development moves fast, and the review-before-apply gate is what keeps fast development from creating production incidents.

For teams that want to understand the full organizational workflow, AI DevOps for teams covers how different roles interact with the same context layer.


Without Context vs. With Context

Scenario Without live context With live context (via MCP)
Writing a Terraform compute config Defaults to t3.large (training data convention) Queries: prod uses g5.2xlarge. Generates correct spec.
Scaling decision for billing-worker Scales up — it appeared to have capacity in last query Queries: 3% CPU avg, $31/mo, 4 replicas. Recommends scale DOWN.
Restarting a database to clear connections Restarts immediately — standard remediation Queries: 3 synchronous services depend on this DB with active traffic. Recommends off-hours.
Writing an HPA config Sets unconstrained maxReplicas Queries: current cluster is 85% memory on 3 nodes. Writes constrained config with warning.
Incident RCA Describes current symptoms only Correlates deployment timestamp with anomaly start. Names the cause.

Setting Up the MCP Workspace for Agents

The Clanker Cloud MCP workspace is a local surface — credentials stay on your machine, the MCP server runs at 127.0.0.1:39393, and agents register against it without gaining direct access to cloud credentials.

Install the CLI:

brew tap clankercloud/tap && brew install clanker

Start the MCP server:

clanker mcp --transport http --listen 127.0.0.1:39393

Register OpenClaw:

openclaw mcp set clanker-cloud --url http://127.0.0.1:39393

For Claude Code and Codex, add to your MCP config:

{
    "mcpServers": {
        "clanker-cloud": {
            "url": "http://127.0.0.1:39393"
        }
    }
}

The three MCP tools available to registered agents are:

  • clanker_version — returns the current workspace version and connected providers
  • clanker_route_question — routes a natural language question to the appropriate provider APIs and returns live data
  • clanker_run_command — executes a command against the infrastructure (requires --maker flag and operator approval)

Full setup documentation is at docs.clankercloud.ai. You can see the workspace in action at the live demo.


BYOK Context: Local Models for Continuous Context Queries

A practical concern with MCP-driven agents is query volume. An agent that queries live infra context before every decision will make dozens of calls per session. At frontier model prices, this adds up.

The answer is model routing by task type. Routine context queries — "what is the current replica count for billing-worker?" or "which services are healthy?" — do not require frontier reasoning capability. They require fast, accurate retrieval of live data. These queries run efficiently on local models at zero cost.

Gemma 4 via Ollama (gemma4:27b or gemma4:e4b for faster inference) handles routine context queries well and runs entirely on local hardware. Hermes via Ollama (hermes3:70b or hermes3:8b, MIT license) is particularly strong at structured tool-use patterns — the kind of reasoning an agent needs to correctly interpret MCP responses and form next queries.

Reserve Claude Opus 4.6 or GPT-5.4 Thinking for the decisions that require it: complex root cause analysis, cross-provider cost investigations, or multi-step incident correlation. The Deep Research feature — which fans out across all connected providers simultaneously — benefits from the deeper reasoning these models provide.

This model-routing approach is practical with BYOK: you configure each model key directly in Clanker Cloud, and your AI costs go to the providers at their published rates with no markup. Routine context queries cost nothing; complex analysis costs what the analysis is worth.


FAQ

What is AI agent infrastructure context and why does it matter?

AI agent infrastructure context is live, structured data about the current state of your infrastructure — service topology, health status, resource costs, configuration values, and recent change history. It matters because AI agents trained on documentation and general infrastructure knowledge have no information about your specific running environment. Without live context, agents make decisions based on defaults and conventions that may not match your actual deployment. With live context, agents can reason accurately about your specific situation and generate plans that reflect current state.

How does MCP provide live infrastructure context to AI agents?

The Model Context Protocol (MCP) is an open protocol that lets AI agents call external tools and data sources in a structured way. Clanker Cloud runs as a local MCP server that agents register against. When an agent calls clanker_route_question with a natural language query, the server routes that query to the appropriate cloud provider APIs — Kubernetes, AWS, GCP, and others — and returns live data. The agent gets real current state without having direct access to cloud credentials, which remain on the local machine.

Which AI agents support infrastructure context via MCP?

Any MCP-compatible agent can connect to the Clanker Cloud MCP workspace. Tested integrations include OpenClaw (68,000+ GitHub stars), Claude Code, Codex, and Hermes. Custom agents written in Python or Node.js can also connect. The MCP server supports both HTTP (--transport http) and stdio (--transport stdio) transports to match different agent architectures.

What is the context gap and how does it cause infrastructure automation failures?

The context gap is the difference between what an AI agent knows from training — documentation, conventions, general patterns — and the current state of a specific live infrastructure environment. Most AI automation failures on real infrastructure are context gap failures: an agent writes a correct script for a generic environment that is wrong for yours. Common examples include wrong instance types, scale recommendations that ignore current utilization, restarts of services that have active dependent traffic, and config changes that conflict with existing limits. Closing the context gap with live MCP queries eliminates this class of failure.


Start With Live Context

The agents and models available in 2026 are capable of genuine infrastructure reasoning — but only when they can reason about your actual environment, not a generic one. The context gap is not a future problem to solve after agents improve further. It is present in every agent interaction that lacks live data, and it produces confident, well-formed output that makes your infrastructure worse.

Clanker Cloud provides the MCP workspace that closes the gap: five context types, live from your connected providers, queryable in natural language, with operator review before any action executes.

Create a free account to connect your providers and start the MCP server. Review the FAQ for common setup questions, or read the for agents page for detailed integration guides.

The docs cover every provider integration, MCP transport option, and agent configuration pattern in detail.

Next step

Give your agent live infrastructure context

Download Clanker Cloud, expose the local MCP surface, and let coding agents work from current cloud, Kubernetes, GitHub, and cost state instead of guesses.

Download and connect MCPWatch demo