Skip to main content
Back to blog

Review Before Apply: The Infrastructure Automation Philosophy That Prevents Incidents

Review-before-apply infrastructure automation shows exactly what will change before execution. The Terraform plan/apply model, applied to AI-driven infra.

The Incident Pattern Nobody Talks About

An AI agent identifies a pod as idle. CPU average over 30 days: 3.1%. Memory usage: under 12%. The agent scales the deployment from four replicas to two. Cost drops by $62 per month. The agent moves on.

Three weeks later, on the last day of the month, invoice processing fails. The billing team files a P1 incident. The on-call engineer traces it back: billing-worker is a batch job. It processes end-of-month invoices in a burst window that only appears once a month. During the 29 other days it genuinely sits idle. The AI agent saw exactly what it was trained to see — low average CPU — and acted on it.

This is not a failure of AI capability. It is a failure of automation design. The agent had no approval gate, no blast radius check, and no plan for the operator to review. It acted on partial context and nobody caught it because nobody was shown anything to catch.

This failure mode has a name: autonomous infrastructure automation without review. And it is entirely preventable.


The Terraform Precedent: Ten Years of Plan Before Apply

HashiCorp shipped terraform plan alongside terraform apply from nearly the beginning. The design is intentional and opinionated: you do not apply changes without first seeing a plan. The plan tells you what will be created, what will be modified, and what will be destroyed. It counts every resource. It surfaces dependencies. It makes the blast radius visible before a single API call is made.

Infrastructure engineers do not skip terraform plan. Not because they are required to — but because the community learned from incidents. Before plan/apply discipline became standard, teams ran terraform apply and watched resources they did not intend to touch disappear. The plan step exists because production infrastructure is stateful, interconnected, and unforgiving of surprises.

Ten years later, AI-driven infrastructure automation is repeating the same lesson from scratch. Agents that write and execute kubectl commands, scale deployments, rotate credentials, or clean up resources without surfacing a plan to the operator are making the same category of mistake that motivated terraform plan in the first place.

Clanker Cloud applies the plan-before-apply model by default. Not as a configuration option — as the architecture.


What a Reviewed Plan Looks Like

When you ask Clanker Cloud to scale down a deployment, you do not get an immediate action. You get a plan:

PLAN: Scale billing-worker from 4 → 2 replicas
Resource: billing-worker (Deployment, namespace: production)
Current state: 4 replicas running
CPU average (30 days): 3.1%
Memory average (30 days): 11.8%
Downstream dependencies: none identified
Estimated cost change: -$62/month
Risk assessment: LOW
Last modified: 14 days ago

Approve this change? [yes/no]

Everything visible before anything happens. The current state is explicit. The cost impact is quantified. The downstream dependency check is run. The last-modified date is surfaced — because a deployment touched 14 days ago warrants more scrutiny than one touched 14 months ago.

This is what AI infrastructure plan review looks like in practice. The operator is not asked to trust the agent's judgment on the safety of the change. They are given the data to form their own judgment.


The Four-Step Model as Four Safety Layers

Clanker Cloud's workflow is structured around four steps, each of which functions as a safety layer in the review-before-apply infrastructure automation model.

Step 1: ASK

The operator or agent submits a natural language request: "scale billing-worker down." The request is parsed against live infrastructure data. No assumptions are made from stale cache — the query goes to the live environment.

Step 2: INSPECT

Clanker reads the current replica count, CPU and memory averages across the defined window, the dependency graph showing what calls this service, when the deployment was last modified, and whether any HPA configuration would conflict with the proposed change. This is the context layer. The agent cannot generate an accurate plan without it — and the billing-worker incident described above is exactly what happens when this step is skipped.

Step 3: PLAN

A change plan is generated. It specifies what will change, what the current state is, what the blast radius is across downstream callers, and what the estimated cost delta is. The plan is surfaced to the operator before any write operation is issued.

Step 4: APPLY

The operator explicitly approves. Maker Mode executes the change. The --maker flag is required. There are no implicit writes. Nothing happens without a deliberate approval decision.

Each step is a checkpoint. Steps 1 and 2 build context. Step 3 makes the consequences visible. Step 4 gates execution on human judgment. Remove any step and you are back to autonomous automation without review.


The Blast Radius Principle

Every plan in Clanker Cloud shows blast radius — who and what is affected by the proposed change.

"Scale down billing-worker" shows downstream callers of billing-worker before you approve. "Delete unused EBS volumes" shows which EC2 instances last used each volume before any deletion is issued. "Rotate IAM credentials" shows which services currently use those credentials before the rotation runs.

The blast radius check addresses the most common failure mode in infrastructure automation: the change that looks safe in isolation but has a non-obvious downstream effect. The billing-worker example is exactly this. Low CPU looks safe in isolation. The blast radius — invoice processing at month-end — is what changes the risk calculus.

This information exists in your infrastructure. Clanker Cloud surfaces it as part of the plan step rather than leaving the operator to go find it manually before approving.


Maker Mode: The Implementation

Maker Mode is how review-before-apply infrastructure automation is enforced at the CLI level. Read-only queries are instant and require no flags. Write operations require --maker and explicit approval.

# Read-only query — instant, no approval required
clanker ask "show me all pods with high memory"

# Write operation — always requires --maker and explicit approval
clanker ask "scale checkout-api to 5 replicas" --maker
# PLAN shown → operator types 'yes' → executes

# Automation pipeline — pre-approved pattern
clanker ask "restart billing-worker" --maker --apply  # Only after plan review

The separation is structural. clanker ask without --maker cannot modify infrastructure. This means a misconfigured agent, a pipeline running against the wrong environment, or a typo in an automation script cannot accidentally trigger a change. The write path requires intent.

The --apply flag is available for automation pipelines where the plan has already been reviewed and approved as part of a defined workflow — for example, a post-deployment restart pattern that runs in CI after a canary passes. Even in that case, the pattern itself was reviewed before it was automated.

The open-source CLI (brew tap clankercloud/tap && brew install clanker) implements this model in full. The flags, the approval gate, and the plan output are all part of the core CLI — not a paid-tier feature.


Agent + Review Workflow: The Hermes Example

The review-before-apply model does not prevent agents from doing useful work. It prevents them from executing changes without operator oversight. Those are different constraints.

In a typical AI agent infrastructure automation workflow with Clanker Cloud, Hermes (running locally via hermes3:70b on Ollama) identifies a cost issue during its monitoring cycle. It reads the Deep Research findings — for example, "Idle worker pool burning compute — averages 3% CPU, 4 replicas running. Save $140/mo." — and generates a scale-down plan.

The plan is surfaced to the operator via Slack or email: here is what I found, here is what I propose to change, here is the blast radius, here is the cost delta, do you approve? The operator reviews and approves. Clanker Cloud executes in Maker Mode.

Hermes never applies changes autonomously. The agent's role is investigation, plan generation, and plan surfacing. The operator's role is review and approval. Maker Mode is the execution gate.

This is the correct division of labor for production infrastructure. Agents are faster and more thorough than humans at gathering context and generating plans. Humans are better at catching the edge case the agent missed — like a batch job that only runs at month-end.

Teams building on AI devops for teams with agent-driven operations can configure any MCP-compatible agent against the local Clanker MCP server (clanker mcp --transport http --listen 127.0.0.1:39393) and get this workflow out of the box.


Comparison: Plan-Before-Apply Across Tools

The review-before-apply philosophy is not unique to Clanker Cloud. It appears across the infrastructure tooling ecosystem wherever practitioners have learned from incidents.

Philosophy Tool Mechanism
Review before apply Terraform terraform planterraform apply
Review before apply Clanker Cloud PLAN step → Maker Mode approval
PR review before merge GitOps / Atlantis PR → atlantis plan → merge → apply
Admission control Kubernetes Admission controller validates before API server accepts

The common thread: every mature infrastructure tool that can make destructive changes has built in a review layer between intent and execution. Terraform learned this from early incidents. Kubernetes admission controllers exist because teams learned what happens when any workload can be scheduled without policy validation. GitOps with Atlantis emerged because direct terraform apply in CI created unreviewed production changes.

AI-driven infrastructure automation is newer than any of these tools, but the lesson is the same. The question is only whether teams learn it from their own incidents or from the design of the tools they choose.

The Clanker Cloud FAQ addresses this directly: the Maker Mode requirement is not a usability limitation. It is the architectural expression of the plan-before-apply principle.


Why This Matters for Fast-Shipping Teams

Teams building with vibe coding to production ship infrastructure changes at a pace that was not possible before AI-assisted development. That speed is valuable. It is also the condition under which review-before-apply matters most.

Slow, deliberate infrastructure changes tend to be well-reviewed by default — there is time to think, check dependencies, and ask a colleague. Fast changes, shipped in bulk, in velocity windows driven by AI-assisted development cycles, are where the billing-worker incident pattern lives. The dependency check that did not happen because the change felt routine. The blast radius that was not checked because the CPU numbers looked conclusive.

Review-before-apply infrastructure automation is not a speed reducer. A plan surfaces in seconds. The approval decision takes five seconds for a change with obvious risk profile and fifteen seconds for one that warrants a closer read. The blast radius check happens automatically. The cost delta is computed automatically. The operator is not asked to do the work the agent can do — they are asked to make the decision only the operator can make.

The combination of AI-generated plans and human approval gates is faster than manual infrastructure management and safer than autonomous AI automation. That is the point of the model.

For teams running BYOK models, the plan generation step can use Gemma 4 via Ollama (gemma4:27b, free and local) for routine scale operations, and escalate to Claude Opus 4.6 or GPT-5.4 Thinking for complex cross-service dependency analysis. The approval gate is the same regardless of which model generated the plan.


FAQ

What is review-before-apply infrastructure automation?

Review-before-apply infrastructure automation is a design pattern where an AI agent or automated tool generates a change plan — showing what will be created, modified, or destroyed, along with blast radius and cost impact — and presents it to the operator for approval before executing any changes. It is modeled on Terraform's plan before apply workflow and prevents the failure mode of autonomous changes made without operator oversight.

How does Clanker Cloud implement review before apply?

Clanker Cloud's four-step workflow (ASK → INSPECT → PLAN → APPLY) enforces the review-before-apply model at the architecture level. Write operations require the --maker flag on the CLI and explicit operator approval of the plan output. The Maker Mode approval gate cannot be bypassed for write operations — read-only queries run without approval, but nothing that modifies infrastructure executes without a deliberate yes.

What are infrastructure automation approval gates and why do they matter?

Infrastructure automation approval gates are checkpoints between a proposed change and its execution. They matter because infrastructure changes — scaling, deletion, credential rotation, policy modification — have blast radii that are not always obvious from the trigger condition alone. A pod with 3% average CPU looks idle; it may be a batch job. An approval gate forces the plan, including downstream dependencies and timing context, to be reviewed before any API call is made.

Can AI agents use review-before-apply workflows automatically?

Yes. In Clanker Cloud's agent workflow, the agent (Hermes, OpenClaw, Claude Code, or any MCP-compatible agent) generates and surfaces the plan but does not execute it. Execution requires operator approval via Maker Mode. The agent handles investigation and plan generation; the operator handles the approval decision. This is the correct division of labor for production infrastructure changes where edge cases — like month-end batch jobs — can invalidate an otherwise reasonable-looking plan.

Does requiring plan review slow down infrastructure operations?

No. Plan generation in Clanker Cloud takes seconds — the INSPECT step reads live state, the PLAN step generates the output, and the operator reviews it. For routine changes with clear risk profiles, the review adds five to fifteen seconds to the operation. For complex changes, the plan surfaces context the operator would otherwise need to gather manually across multiple tools and consoles. The net effect is faster and more informed decision-making, not slower operations.


Start With Review Before Apply

Autonomous infrastructure automation without review is a known failure mode with a known fix. The fix is what Terraform introduced ten years ago, what Kubernetes admission controllers enforce, and what GitOps with Atlantis brought to IaC in CI: show the operator what is about to happen before it happens.

Clanker Cloud applies this model to AI-driven infrastructure automation by default. The plan is always surfaced. The blast radius is always shown. The approval is always required.

If you are building with AI-assisted development and shipping infrastructure changes at speed, review-before-apply is the safety layer that keeps that speed from creating incidents. Download Clanker Cloud, connect your providers, and try a live demo — ask it to scale something, and see the plan before anything changes.

Full documentation at docs.clankercloud.ai.

Next step

Turn this playbook into a live infrastructure check

Download the desktop app, connect existing credentials locally, and ask Clanker Cloud the same kind of question against your real cloud, Kubernetes, GitHub, or cost data.

Download Clanker CloudWatch demo