10 min readClanker Cloud Editorial Team

Right-Sizing and Idle Resource Detection: Stop Paying for Cloud You Don't Use

Learn how cloud right-sizing AI and idle resource detection eliminate cloud waste across AWS, Kubernetes, RDS, and staging environments.

Download Clanker Cloud Watch demo

Here is a common infrastructure story: you launch a service on an m5.2xlarge because the traffic estimate calls for it. The traffic never fully materializes. Six months later, that instance is sitting at 8% CPU, costing $278 per month. Nobody notices. Nobody fixes it. The instance just runs, and the bill climbs.

Multiply that by twenty instances across three environments, add a few overprovisioned RDS databases, a Kubernetes node pool twice the size it needs to be, and a staging stack running around the clock — and you have a cloud bill with 30–40% waste baked in.

Cloud right-sizing AI is the systematic process of detecting those gaps and closing them. This article covers exactly how to find idle and overprovisioned resources across EC2, Kubernetes, RDS, and staging environments — and how Clanker Cloud makes that detection conversational rather than a manual exercise.

The Overprovisioning Problem

Teams overprovision for two reasons. First, fear: provision big before launch, because the cost of being under-resourced at 2am is worse than the cost of a larger instance. That's a reasonable call in the moment.

The second reason is inertia. After the launch, nobody goes back to right-size. The instance type that was chosen under uncertainty becomes permanent by default. Three years later, the RDS instance class that made sense for a data migration project is still running for a database with 400MB of data and twelve daily queries.

The concrete math: an m5.2xlarge at $0.384/hr runs $278/month. The same workload on a t3.medium at $0.0416/hr runs $30/month. That's $248/month on a single instance. At the organizational level, Gartner estimates that cloud overspending runs 20–40% of total cloud spend for most companies.

The fix is not discipline. It's detection. You cannot right-size what you cannot see.

The Detection Gap

AWS CloudWatch has all the data you need. It captures CPU utilization, memory usage, network throughput, disk I/O — everything required to identify idle and overprovisioned resources. The problem is that the data is not surfaced automatically. You have to know what to look for, write the right queries, and review them on a regular schedule.

Most teams do none of that consistently. Right-sizing audits happen when someone complains about the bill, not as a standing practice.

Clanker Cloud changes that by making utilization data conversational. Instead of building CloudWatch dashboards and writing metric filters, you ask in plain English: clanker ask "which EC2 instances are underutilized?" — and you get answers from live AWS data, not estimates or cached reports. For a deeper look at how this fits into a broader infrastructure workflow, see the AI DevOps for teams overview.

Idle EC2 and VM Detection

The working definition of an idle EC2 instance: average CPU utilization below 5% for 14 or more consecutive days. That threshold eliminates bursty workloads and catches genuine dead weight.

To surface these with Clanker Cloud:

clanker ask "show me EC2 instances with average CPU below 10% in the last 30 days"

This queries CloudWatch CPUUtilization metrics across your account and returns a list of instances with their average utilization, instance type, and monthly cost. From there, the options are:

Stop — immediate, preserves the instance and EBS volume, billing stops for compute
Schedule — shut down at 7pm, start at 8am on weekdays (useful for non-prod instances)
Right-size — move to a smaller instance type
Terminate — for instances confirmed to be unused, with EBS volumes detached and snapshotted first

One category that gets missed: unattached EBS volumes. When an instance is terminated without deleting its volume, that volume keeps accruing charges — typically $0.10/GB-month for gp3. A 500GB volume that's been sitting unattached for a year has cost $600 for nothing. Clanker Cloud surfaces these alongside idle instances:

clanker ask "show me EBS volumes with no attached instance"

Kubernetes Right-Sizing

Kubernetes adds a layer of complexity to idle resource detection because waste shows up at two levels: the node and the pod.

Node-level waste is the easier pattern to spot. If a node group is running ten nodes and aggregate CPU utilization is at 20%, you have six or seven nodes worth of headroom. Consolidating to four nodes cuts the node pool bill by more than half without changing the workload.

Pod-level waste is more subtle. Kubernetes pod resource requests determine scheduling and autoscaling behavior. If a pod has a CPU request of 2 cores but actual usage is 0.3 cores, the scheduler treats that node as if 2 cores are consumed. The node looks full but is running at 15% actual utilization. The fix is adjusting resource requests to match actual usage — which is exactly what the Vertical Pod Autoscaler (VPA) automates.

To see where pod requests diverge from actual usage:

clanker ask "what's the average CPU utilization of my EKS node group?"

Clanker Cloud queries the Kubernetes metrics API directly and returns node-by-node and pod-by-pod utilization, alongside the configured requests and limits. From there, you can identify candidates for VPA recommendations or manual request tuning.

For overallocated node pools: an EKS node group with ten nodes when four would suffice is a common staging/dev environment pattern. Kubernetes right-sizing in these environments can yield 50–60% reductions in node group costs.

RDS and Database Right-Sizing

RDS instances are the most reliably forgotten category of overprovisioned resources. The pattern is predictable: someone launches a db.r5.2xlarge for a data migration or a high-traffic launch, the workload stabilizes, and the instance type stays unchanged indefinitely.

Candidates for right-sizing typically show:

Average CPU below 10% over a 30-day period
Storage utilization below 50%
Low connection counts relative to the instance's connection limit

The important caveat with databases: do not right-size on average metrics alone. Some databases have workloads with high p99 spikes — batch jobs, end-of-month reporting, or irregular query bursts — that would breach a smaller instance's limits even though the average looks idle. Before changing an RDS instance class, review the p99 CPU and IOPS metrics alongside the averages.

Other RDS waste patterns:

Multi-AZ in dev/staging: Multi-AZ roughly doubles the cost of an RDS instance. For non-production environments with no SLA requirement, single-AZ is almost always appropriate.
Abandoned read replicas: Read replicas that haven't served meaningful traffic in months are pure overhead. Check DatabaseConnections in CloudWatch for each replica — a replica at zero connections for 90 days is a candidate for removal.

Staging Environment Scheduling

If there is one right-sizing win that delivers an immediate, visible return, it is staging environment scheduling. A full staging stack running 24/7 costs the same as production. It is used for perhaps 8 hours a day, 5 days a week.

The math: a staging stack at $500/month runs 744 hours/month. Scheduled to run only during business hours (8am–7pm Monday–Friday = 165 hours/month), the same stack costs about $111/month. That's a 78% reduction in staging infrastructure cost. Even a more conservative schedule — shut down at 10pm, start at 7am — delivers around 65% savings.

To generate this schedule in Clanker Cloud:

clanker ask --maker "create a schedule to stop my staging EC2 instances outside business hours"

Maker mode returns a plan — the specific instances, the proposed schedule, the estimated savings — for your review before anything is applied. Nothing changes until you approve it. That read-first/act-second pattern is consistent across all Clanker Cloud recommendations: it surfaces the opportunity, you decide.

Orphaned Resource Cleanup

Orphaned resources are infrastructure debris: things left behind when the primary resource they were attached to was deleted. They don't cause problems, so nobody notices them. They just accumulate cost.

The main categories:

Old EBS snapshots — Snapshots from instances that no longer exist are commonly left in accounts for years. At $0.05/GB-month for gp2 snapshots, a few hundred unneeded 100GB snapshots add up quickly. Review the snapshot list and remove those where the source instance no longer exists.

Unused Elastic IPs — AWS charges $0.005/hr per Elastic IP that is allocated but not associated with a running instance. That's $3.65/month each. It sounds small, but accounts with a long history often have dozens of unused EIPs from past experiments, old VPNs, or decommissioned services.

Unused Load Balancers — An Application Load Balancer costs a minimum of $0.008/hr regardless of traffic, roughly $6/month. A load balancer with no healthy targets and no traffic for 60+ days is a safe candidate for removal.

Old AMIs — AMIs themselves don't cost money to store (the associated snapshots do). More importantly, an account with hundreds of stale AMIs is a hygiene problem: it creates confusion about which image is the current baseline and can slow provisioning workflows that scan the AMI list.

clanker ask "show me Elastic IPs not associated with a running instance"
clanker ask "show me load balancers with no healthy targets in the last 30 days"

Automating Detection with AI Agents

Manual right-sizing audits work, but they degrade over time. The sustainable approach is making detection automatic so that idle resources surface without anyone having to remember to look.

Two patterns work well here:

Weekly right-sizing report via MCP: Clanker Cloud exposes an MCP interface that lets agents trigger utilization checks and receive right-sizing recommendations programmatically. Configuring a weekly agent run — through OpenClaw's HEARTBEAT.md or any scheduler — produces a structured report of idle EC2 instances, overprovisioned node pools, and orphaned resources without manual intervention. See the AI agents integration page for setup details.

Pre-deploy idle check with Claude Code: Before scaling up capacity, an automated check can ask whether idle resources already exist that could serve the need. A Claude Code workflow that runs clanker ask "are there any stopped EC2 instances matching this workload profile?" before a scale-up event catches cases where a new instance would duplicate existing stopped capacity. It's a lightweight gate that prevents the overprovisioning pattern from repeating.

These integrations are documented in the Clanker Cloud docs and are available starting with the Pro plan.

FAQ

How do I find underutilized EC2 instances in AWS?

The most direct method is querying CloudWatch CPUUtilization metrics with a 14- or 30-day window. You're looking for instances where the average stays below 10% for the entire period. In Clanker Cloud, you can do this with a single natural language query: clanker ask "show me EC2 instances with average CPU below 10% in the last 30 days". In the native AWS console, you can use the Compute Optimizer service, which provides right-sizing recommendations based on CloudWatch data, though it requires enabling the service and waiting for it to collect sufficient data.

What is cloud right-sizing and how do I do it?

Cloud right-sizing is the process of matching the capacity you've provisioned to the capacity your workloads actually use. An overprovisioned instance has more CPU, memory, or storage than the workload ever consumes. Right-sizing means moving it to a smaller, cheaper resource class without degrading performance. The process involves collecting utilization data over a representative period (30 days minimum), identifying instances or services running well below capacity, reviewing p99 metrics to rule out bursty workloads, and then changing the instance type — typically during a maintenance window for production resources.

How much money can I save by right-sizing my cloud resources?

Industry benchmarks from AWS and third-party cloud cost management vendors consistently put cloud waste at 30–40% of total spend for organizations that haven't done systematic right-sizing. For a team spending $10,000/month on cloud infrastructure, that's $3,000–$4,000/month in recoverable spend. The largest single wins typically come from staging environment scheduling (50–78% reduction in staging costs), RDS instance right-sizing (often 40–60% reduction for forgotten databases), and EC2 instance downsizing for consistently low-utilization workloads.

How do I detect idle resources across AWS and Kubernetes?

For AWS, CloudWatch metrics are the primary data source — specifically CPUUtilization, NetworkIn/Out, and DatabaseConnections for RDS. For Kubernetes, the metrics API (via metrics-server or Prometheus) provides node and pod-level utilization data. Idle cloud resource detection requires querying both sources, correlating the results, and reviewing them against the configured resource requests. Clanker Cloud queries both AWS and Kubernetes metrics through a single interface, so you can ask about EC2 idle resources and EKS node utilization in the same session without switching between consoles or writing metric queries manually.

Start Finding Waste Today

Right-sizing is not a one-time project. Infrastructure provisioning decisions accumulate faster than they get reviewed. The teams that control cloud costs long-term are the ones that make idle resource detection a standing practice rather than a quarterly scramble.

Create a free Clanker Cloud account and run your first idle resource query in under five minutes. Or see a live demo of how Clanker Cloud queries your real infrastructure and surfaces right-sizing recommendations.

Next step

Run the cost check against your own infrastructure

Download the desktop app, keep credentials local, and ask Clanker Cloud to connect spend, topology, and recent changes across the providers you already use.

Download Clanker Cloud Watch demo

Byline

Clanker Cloud Editorial Team

Editorial Team

Clanker Cloud Editorial Team writes about local-first infrastructure, multi-cloud operations, AI-assisted incident response, and safer workflows for builders and infrastructure teams.