Your AWS bill jumped 40% this month. You open Cost Explorer. It shows: EC2 $2,400, RDS $1,800, data transfer $890, ElastiCache $670. None of these line items explain what changed or why. So you start cross-referencing: filter by tag, drill into EC2, check the usage type, open another tab for CloudWatch, try to remember which service was deployed three weeks ago. An hour later, you have a theory. Maybe.
This is cloud cost investigation — and it is the foundational work that cost optimization depends on. You cannot optimize what you do not understand. The optimization strategies covered in articles on rightsizing and reserved instance planning presuppose that you already know which resources are responsible for what spend. Most teams skip the investigation phase entirely, moving straight from "the bill is high" to "let's turn things off." That is how you turn off the wrong thing.
AI cloud cost investigation changes the starting point. Instead of clicking through dashboards to build a picture, you query your live infrastructure in plain English and get back structured answers — resource by resource, dollar by dollar.
How AI Cost Investigation Differs from Cost Explorer
AWS Cost Explorer, GCP Billing, and Azure Cost Analysis are billing dashboards. They show you aggregated spend organized by service, region, and tag. They are built to display what you were charged, not to help you understand why your costs changed or which specific resources are responsible.
The distinction matters in practice. Cost Explorer tells you EC2 cost $2,400. An AI cloud cost investigation tool tells you: "These 11 instances account for $2,400 in EC2 spend this month. Three of them — worker-1, worker-2, worker-3 — are running at under 4% average CPU and have no active traffic. They account for $380 combined. The other $2,020 is split across your production cluster and two RDS proxy instances."
That second answer gives you something to act on. The billing dashboard gives you a number.
Clanker Cloud connects directly to your cloud providers using credentials that stay on your local machine — no data sent to a vendor platform, no agent rollout required. From one workspace, you query live infrastructure across AWS, GCP, Azure, Kubernetes, Hetzner, DigitalOcean, and more. The AI DevOps for Teams page covers the team workflow in detail. The investigation queries below show what this looks like in practice.
Anomaly Detection Queries
The first job in a cost investigation is identifying what changed. Billing anomalies rarely announce themselves — you notice the total is higher, but the cause requires tracing.
These queries surface the change signal:
- "which resources had the biggest cost increase compared to last month?"
- "are there any services running that weren't running 30 days ago?"
- "show me all EC2 instances with no tags"
- "which Lambda functions have unusually high invocation counts this week?"
The tags query is particularly useful. Untagged resources are a signal that something was provisioned outside your normal workflow — a manual test environment, a one-off deployment, a misconfigured autoscaling event. Untagged resources also cannot be attributed to a team or project, which means they accumulate without accountability.
When you ask "which resources had the biggest cost increase compared to last month?", the response doesn't just return a sorted list — it flags the outliers. A Lambda function that ran 200,000 invocations last month and 4.2 million this month is a different kind of finding than a gradual RDS growth trend. Both matter. They require different follow-up queries.
Per-Resource Cost Visibility
Billing dashboards show cost by service. Cost investigation requires cost by resource — the specific instance, database, or queue that is responsible for a line item.
- "show me all RDS instances and their monthly cost"
- "list all load balancers and their monthly cost"
- "show me all EC2 instances in us-east-1 with their hourly rate and monthly estimate"
- "which S3 buckets are growing fastest by storage cost?"
A concrete example from the Clanker Cloud workspace: querying "show me all RDS instances and their monthly cost" returns a per-instance breakdown. orders-postgres costs $198/month and is handling 2,100 queries per second — that spend is justified by load. staging-db costs $34/month and shows near-zero traffic outside business hours. The staging database is a candidate for scheduled stop/start. You would not have seen that distinction in a billing dashboard showing "RDS: $232."
The per-service cost view in the Clanker Cloud UI makes this concrete by default. A live environment shows: PUBLIC-INGRESS $26/mo, PROD-CLUSTER $182/mo, CHECKOUT-API $44/mo, ORDERS-API $51/mo, BILLING-WORKER $31/mo, ORDERS-POSTGRES $198/mo, SESSION-CACHE $67/mo. Every resource has a dollar figure attached. This is what a cost-aware infrastructure view looks like — not aggregate service totals, but per-resource spend visible alongside operational state.
Waste Identification Queries
Waste is not the same as high spend. A $200/month database handling your core transaction load is not waste. A $34/month load balancer receiving six requests per day is waste.
These queries identify idle and unused resources that accumulate in most cloud accounts over time:
- "show me all load balancers with less than 100 requests per day"
- "are there any Elastic IPs not attached to a running instance?"
- "list all EBS volumes that are not attached to an EC2 instance"
- "show me all EC2 instances with average CPU below 5% over the last 30 days"
- "which NAT Gateways are processing less than 1GB per day?"
Unattached EBS volumes and Elastic IPs are among the most common sources of background waste in AWS accounts. They accumulate silently — a developer spins up a test instance, terminates it, and the volume persists. The CPU utilization query catches over-provisioned compute. An EC2 instance averaging 4% CPU over 30 days is not doing the work it was sized for. This does not automatically mean you should downsize it — some workloads have burst profiles that don't show in averages — but it is the right starting point. Investigation surfaces the candidate; the team decides what to do with it.
Data Transfer Cost Queries
Data transfer is among the least visible cost categories in cloud billing. It appears as a line item without attribution to specific services or traffic patterns, and for accounts with significant cross-AZ or cross-region traffic it can reach hundreds of dollars per month while remaining opaque.
These queries surface the attribution:
- "show me services with high data transfer costs"
- "which services are generating the most cross-AZ traffic?"
- "are there any services pushing large amounts of data to S3 unnecessarily?"
Cross-AZ traffic is a particularly common source of unexpected spend. If your application instances in us-east-1a are making synchronous calls to a database in us-east-1b, every request generates cross-AZ data transfer charges. Moving the database or co-locating the instances eliminates the charge — but you cannot make that decision without first knowing the traffic pattern.
Deep Research: One-Pass Cost Scan with Ranked Findings
The most powerful tool in an AI cloud cost investigation is not a query — it is a full-estate scan. The Deep Research feature fans out across every connected provider simultaneously, runs parallel analysis, and returns severity-ranked findings across cost, security, and reliability.
For a cost investigation, the entry point is: "scan all connected providers for cost waste."
A typical Deep Research cost scan returns findings like:
- HIGH: "Idle worker pool burning compute — worker-pool averages 3% CPU, 4 replicas running. Scale down or enable HPA. Save $140/mo."
- MEDIUM: "Uncompressed S3 backups growing fast — current growth rate projects $60/mo additional cost in 90 days."
- MEDIUM: "Three EBS volumes unattached for 45+ days — combined cost $28/mo."
Each finding includes the resource name, the dollar amount, and the recommended action. The findings are ranked by severity and estimated savings, so you can triage the investigation rather than working through an undifferentiated list.
This is the difference between targeted queries — which answer specific questions you already know to ask — and Deep Research, which surfaces findings you did not know to look for. The idle worker pool finding above is a real pattern: a pool provisioned for a traffic spike that was never scaled back down. It runs for months without anyone noticing because the application continues to work correctly. Deep Research finds it because it compares actual resource utilization to provisioned capacity across every service simultaneously.
For teams operating across multiple cloud providers, Deep Research operates across all connected accounts in one pass. Full documentation for setting up provider connections is at docs.clankercloud.ai.
The Investigation to Optimization Workflow
Cost investigation and cost optimization are distinct phases. Investigation builds the picture. Optimization changes it. Conflating them leads to premature changes made without full context.
The workflow in Clanker Cloud follows a clear sequence:
- Run Deep Research for cost findings — get a ranked list of waste and anomalies across all providers.
- Query specific anomalies — "why did orders-postgres cost increase 30% this month?" — and get back the specific cause: query volume spike, new index scan pattern, data growth, or changed instance class.
- Identify the cause — the investigation answer tells you whether the increase is a genuine load increase (which may be fine) or an unexpected change (which may need action).
- Generate a plan — "optimize the orders-postgres instance class" → Clanker generates a reviewed plan showing the current instance type, the proposed change, the estimated cost delta, and the risk assessment.
- Apply with Maker Mode approval — the operator reviews the plan and explicitly approves execution. Nothing changes until you say so.
This sequence matters because step 3 — identifying the cause — determines whether steps 4 and 5 are appropriate. If orders-postgres cost increased 30% because your transaction volume grew 35%, that is expected behavior. If it increased 30% because a new background job is running a full-table scan every hour, that is a bug. The investigation query surfaces which situation you are in before you make a change.
The full production deployment workflow is covered at /vibe-coding-to-production. The FAQ covers common questions about maker mode approval gates.
Multi-Cloud Cost Investigation
Single-cloud cost investigation is tractable with native billing tools. Multi-cloud cost investigation is not — each provider has a different billing format, different resource taxonomy, and a different API. Correlating spend across AWS, GCP, and Azure requires exporting from three systems and normalizing manually.
The Clanker Cloud query "show me total monthly spend across all connected cloud accounts" returns a unified view across AWS, GCP, Azure, Hetzner, and DigitalOcean simultaneously. One query, one response, one place to see the full picture.
This matters for teams running a realistic multi-cloud setup: AWS for core compute, GCP for ML workloads, Cloudflare for edge, Hetzner for cost-efficient EU VMs. A unified investigation surface means you can ask "which provider has the most idle compute this month?" and get a ranked answer across all of them.
For AI agents running cost monitoring workflows, the /for-ai-agents.md page covers the MCP interface that lets agents query live cost data programmatically. The /demo shows the multi-cloud investigation workflow end to end.
BYOK for Cost Investigation
Clanker Cloud is a bring-your-own-keys tool: your AI model credentials stay on your machine and are billed directly by the provider, with no markup. For cost investigation specifically, this means you can match the model to the complexity of the query.
Routine investigation queries — "show me all EC2 instances with no tags", "list all load balancers and their monthly cost" — are well-handled by Gemma 4 via Ollama running locally. The gemma4:26b model runs on most developer machines with 32GB RAM and costs nothing per query. For teams running dozens of investigation queries per day, this is material.
Complex cross-account analysis — Deep Research across three cloud accounts, anomaly detection requiring multi-month cost trend analysis, or root-cause queries that require reasoning across provider APIs simultaneously — benefit from a model with stronger reasoning. GPT-5.4 Thinking handles cross-account analysis well; Claude Opus 4.6 performs similarly for investigation tasks requiring extended context.
The BYOK model means you pay OpenAI or Anthropic directly at listed rates, only for queries that need it. Hermes 3 (hermes3:70b via Ollama) is a strong option for agentic cost monitoring workflows — MIT license, local, well-suited to tool-use patterns where an agent runs periodic cost scans and routes findings.
FAQ
What is AI cloud cost investigation? AI cloud cost investigation is the practice of using plain-English queries against live cloud infrastructure to surface per-resource cost data, detect billing anomalies, and identify idle or wasted resources — before making any optimization changes. It differs from billing dashboards, which show aggregated service totals, by providing resource-level attribution with operational context.
How is AI cloud cost investigation different from cost optimization? Investigation is the phase where you build a clear picture of what is running and what it costs. Optimization is the phase where you make changes based on that picture. Investigation queries answer "what is happening and why?" — optimization actions answer "what should we change?" The investigation → optimization workflow: investigate first, generate a reviewed plan, then apply with explicit approval.
Can I investigate costs across multiple cloud providers with one query? Yes. Clanker Cloud connects to AWS, GCP, Azure, Hetzner, DigitalOcean, and Cloudflare simultaneously. A query like "show me total monthly spend across all connected cloud accounts" returns a unified view across all connected providers. Deep Research fans out across all providers in one pass and returns ranked findings.
Do I need to send my cloud credentials or billing data to a third party to use AI cost investigation?
With Clanker Cloud, no. The desktop app reads your local credentials (~/.aws/credentials, ~/.kube/config, etc.) and queries your cloud providers directly from your machine. Credentials never leave your machine. AI queries go directly from your machine to your chosen AI provider — OpenAI, Anthropic, Google, or a local model via Ollama — with no intermediate vendor proxy.
Cost investigation is the work that makes every subsequent optimization decision credible. When you know that orders-postgres at $198/month is handling 2,100 queries per second and that staging-db at $34/month has been idle for three weeks, you can make a specific, justified change. When you know your worker pool is running at 3% CPU with four replicas, you can size it correctly. That knowledge is not in your billing dashboard. It is in your live infrastructure — and plain-English queries are the fastest way to surface it.
Download Clanker Cloud and connect your first provider at clankercloud.ai/account. The first cost investigation query takes about 90 seconds from install.
Run the cost check against your own infrastructure
Download the desktop app, keep credentials local, and ask Clanker Cloud to connect spend, topology, and recent changes across the providers you already use.
