Skip to main content
Back to blog

Debugging AWS Infrastructure: The Hard Way vs. Asking Your AI Workspace

Debug AWS infrastructure faster. See how CloudWatch, IAM, EC2, ECS, and cost issues are investigated the traditional way — then with plain English AI queries.

AWS has over 200 services. When something breaks in production, you're suddenly juggling CloudWatch Logs Insights queries you have to look up every time, AWS CLI flags buried in documentation, IAM policies three clicks deep in the console, and billing reports that tell you what happened last month. Debugging AWS infrastructure is powerful — AWS gives you everything you need — but the cognitive surface area is enormous.

This article walks through five of the most common AWS troubleshooting scenarios. For each one, we'll show the standard investigation path: the real commands, the console navigation, the places you look. Then we'll show what the same investigation looks like when you can just ask.


Scenario 1: A Service Is Timing Out

The Hard Way — CloudWatch, ALB Logs, EC2 Metrics

A service starts returning 504s. You open CloudWatch, navigate to Logs Insights, and try to remember whether the query syntax uses filter or WHERE. You write something like:

fields @timestamp, @message
| filter @message like /error|timeout|5\d\d/
| sort @timestamp desc
| limit 50

You get results from the application log group. Then you switch tabs to check ALB access logs — stored in S3, so you either query them through Athena (another service, another schema to remember) or you download and grep manually.

Meanwhile, you open a second browser tab to check EC2 CPU utilization in CloudWatch Metrics. Different namespace. Different console section. You set the time range again because CloudWatch doesn't carry it across views. You check memory — wait, CloudWatch doesn't collect memory by default. You need the CloudWatch agent for that, which you may or may not have installed.

Now check security groups:

aws ec2 describe-security-groups \
  --filters "Name=group-name,Values=my-service-sg" \
  --query "SecurityGroups[*].IpPermissions"

Cross-reference port 443 and 8080 rules. Check the target group health in the load balancer. Somewhere in all of this is the answer.

Total time: 30–90 minutes, depending on how good your memory for AWS console navigation is.

The Clanker Cloud Way

"Which EC2 instances in us-east-1 are showing high latency or errors in the last hour?"

Clanker Cloud queries CloudWatch, EC2, and the ALB simultaneously and returns a correlated answer: which instances are affected, what error patterns appear in the logs, whether CPU or memory is spiking, and whether any target group health checks are failing. One question, one response — no tab switching, no query syntax to reconstruct from memory.

If you want to drill deeper: "Show me the CloudWatch error logs for instance i-0abc1234 from the last 30 minutes." It fetches and summarizes them in plain English, with the raw log entries available if you want them.


Scenario 2: Permission Denied Errors

The Hard Way — IAM Policy Simulator and simulate-principal-policy

Your Lambda function is failing with AccessDenied on S3. Time to figure out why.

First, find the Lambda execution role:

aws lambda get-function-configuration \
  --function-name my-function \
  --query "Role"

Then list attached policies on that role:

aws iam list-attached-role-policies \
  --role-name my-function-role

Now get the inline policies:

aws iam list-role-policies \
  --role-name my-function-role

Check each one. You can use the IAM Policy Simulator in the console, or run simulate-principal-policy from the CLI:

aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789:role/my-function-role \
  --action-names s3:GetObject s3:PutObject \
  --resource-arns arn:aws:s3:::my-bucket/*

If the result is implicitDeny, you check whether there's a resource-based policy on the S3 bucket itself blocking access. If you're in an AWS Organization, there may be an SCP limiting what IAM roles can do — and SCPs live in a completely separate part of the console.

Permission boundaries? Those override everything else. Better check for those too.

This investigation can take 20 minutes for a straightforward case or two hours if there are multiple policy layers involved.

The Clanker Cloud Way

"Why is my Lambda function getting AccessDenied on S3?"

Clanker Cloud inspects the Lambda execution role, walks the full IAM policy chain — attached policies, inline policies, resource-based policies, permission boundaries — and surfaces the specific gap in plain English: "The execution role my-function-role has s3:GetObject but is missing s3:PutObject on arn:aws:s3:::my-bucket/*. The S3 bucket policy also restricts access to specific VPC endpoints."

Then: "What do I need to add to fix it?" gets you the exact policy statement to add, ready to review before any change is made.


Scenario 3: Cost Spike Investigation

The Hard Way — Cost Explorer, CloudTrail, Manual Cross-Reference

Your AWS bill this week is 40% higher than last week. You open Cost Explorer, set the date range, group by service, switch to daily granularity, and start looking for which service jumped. Found it — EC2 costs doubled.

Now filter Cost Explorer by EC2, group by usage type. Is it On-Demand hours? Data transfer? EBS volumes?

Once you've narrowed it to On-Demand instance hours in us-east-1, you need to figure out what changed. Switch to CloudTrail and search for RunInstances events in the same time window:

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances \
  --start-time 2025-07-07T00:00:00Z \
  --end-time 2025-07-14T00:00:00Z

Cross-reference the instance IDs from CloudTrail with currently running instances:

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query "Reservations[*].Instances[*].[InstanceId,InstanceType,LaunchTime,Tags]" \
  --output table

Look for instances launched around the time the cost jumped. Check if they have Name tags. Check if they're part of an Auto Scaling group that wasn't supposed to scale. Check whether a load test spun up resources that never got cleaned up.

This is doable. It's just slow — and if the root cause spans multiple services, each service needs its own investigation pass.

The Clanker Cloud Way

"Why did our AWS bill go up 40% this week?"

Clanker Cloud pulls cost data, identifies the services with the largest week-over-week increases, and cross-references recent CloudTrail events to surface what changed. The response: "EC2 costs in us-east-1 increased $340 this week. CloudTrail shows 12 new m5.xlarge instances launched on July 9th under the performance-testing Auto Scaling group. 8 of those instances are still running."

Follow-up: "Are those instances still needed?" — Clanker checks tags, last connection time, and whether any active deployments reference them.

For teams shipping fast, this kind of cost visibility in minutes instead of hours is the difference between catching a runaway resource the same day versus finding it on next month's invoice.


Scenario 4: Deployment Failed

The Hard Way — CodeDeploy, ECS, ECR, CloudFormation

Your production ECS deployment failed. Where do you start?

Check CodeDeploy first:

aws deploy list-deployments \
  --application-name my-app \
  --deployment-group-name production \
  --include-only-statuses Failed \
  --query "deployments[0]"

Get the deployment details:

aws deploy get-deployment \
  --deployment-id d-ABCDEF123

Check ECS service events:

aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query "services[0].events[:10]"

Look at the failed task definition — maybe the container image tag doesn't exist in ECR:

aws ecr describe-images \
  --repository-name my-app \
  --query "imageDetails[?imageTags!=null].[imageTags,imagePushedAt]" \
  --output table

Check the target group health to see if the new containers failed health checks before being deregistered:

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:...

If it's a CloudFormation-backed deployment, check the stack events for rollback reasons:

aws cloudformation describe-stack-events \
  --stack-name production-stack \
  --query "StackEvents[?ResourceStatus==\`UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS\`]"

By the time you've connected all these threads, you understand what failed. But you've touched six services to get there.

The Clanker Cloud Way

"What happened to the last deployment in the production ECS cluster?"

Clanker Cloud correlates events across ECS, ECR, CodeDeploy, and the target group in one pass: "The deployment d-ABCDEF123 failed at 14:32 UTC. The ECS task failed to start — the container image my-app:v1.4.2 referenced in the task definition does not exist in ECR. The most recent image in the repository is my-app:v1.4.1, pushed July 13th. The target group shows no healthy targets as a result."

Now you know the exact fix without opening six AWS console tabs.


Scenario 5: Security Misconfiguration Found

The Hard Way — Security Hub, Security Groups, S3 Bucket Policies

AWS Security Hub gives you findings, but working through them is manual. You open a finding, navigate to the affected resource, and inspect it directly.

For security group misconfigurations, the CLI gives you what you need — but parsing the output is its own task:

aws ec2 describe-security-groups \
  --filters "Name=ip-permission.cidr,Values=0.0.0.0/0" \
  --query "SecurityGroups[*].[GroupId,GroupName,IpPermissions]" \
  --output json

A large AWS account can have hundreds of security groups across multiple regions. You'd need to loop through regions and parse the output to get a complete picture.

For S3 public access, you check bucket-level Block Public Access settings:

aws s3api get-public-access-block \
  --bucket my-bucket

And the bucket policy and ACLs separately. With dozens of buckets, that's a lot of commands.

VPC Flow Logs — if you've enabled them — tell you what traffic is actually reaching those open ports, but they're stored in CloudWatch or S3 and require Logs Insights queries or Athena to analyze.

The Clanker Cloud Way

"Show me all security groups in us-east-1 with 0.0.0.0/0 inbound rules"

Response: a clean list of security group names, IDs, and which ports are exposed — sorted by risk (port 22 and 3389 first).

"Are any of our S3 buckets publicly accessible?"

Response: flags which buckets have public Block Public Access disabled, which have bucket policies granting public access, and which have public ACLs — categorized so you can prioritize the actual risks versus the ones that are public intentionally (a static website bucket, for example).

This is also where Clanker Cloud's autonomous security agents add ongoing value: they continuously scan for new misconfigurations and surface findings before they appear in a Security Hub report or, worse, an incident.


How Clanker Cloud Works with AWS

Clanker Cloud is a local-first desktop application. Your AWS credentials stay on your machine — no credential handover to a hosted SaaS layer, no third-party copy of your access keys.

Credential support: Works with standard AWS credential profiles, IAM roles, SSO profiles configured via aws configure sso, and multi-account setups using role assumption chains. If your terminal can run aws commands, Clanker Cloud can use the same configuration.

BYOK model selection: Clanker Cloud uses your own AI model API keys with no token markup. For fully air-gapped workflows — nothing leaves your machine — run Gemma 4 locally. For agent workflows that need stronger reasoning, use Claude Code, OpenAI Codex, or Hermes. The choice is yours, and you can switch per-session.

Read-first by default: Every query Clanker Cloud runs is read-only unless you explicitly enable maker mode. In maker mode, Clanker generates a reviewed, human-approved plan before any change is applied. This is the same safety model that makes it useful for production environments: you get the full diagnostic picture without any accidental writes.

Multi-cloud and multi-tool: The same surface that queries your AWS infrastructure also works with GCP, Kubernetes, Cloudflare, GitHub, Hetzner, and DigitalOcean. Cross-cloud investigations — "which regions do we have resources in across AWS and GCP?" — are a single question, not a multi-console exercise.

If your team uses Claude Code or Codex for AI-assisted development, Clanker Cloud supports MCP-based integration so your coding agent can query live infrastructure context directly. Read more about AI DevOps for teams or check the full documentation.


AWS Resources Worth Bookmarking

These are the official AWS docs that come up repeatedly for the scenarios above:

  • CloudWatch Logs Insights query syntax — the complete reference for filter, stats, parse, and fields commands. Worth bookmarking because the syntax is not intuitive and the autocomplete in the console only helps so much.

  • IAM Policy Simulator — the console tool for testing whether a given principal can perform an action. Faster than simulate-principal-policy for one-off checks, but the CLI version is better for scripting.

  • AWS Cost Explorer API reference — useful when you need to pull cost data programmatically rather than clicking through the console. The GetCostAndUsage endpoint with GroupBy and Filter parameters covers most cost investigation use cases.

  • AWS Security Hub findings format — helps when you're writing automation to process Security Hub findings or integrating with a SIEM.

  • VPC Flow Logs query examples — CloudWatch Logs Insights example queries for analyzing traffic patterns, rejected connections, and bandwidth by source.


Conclusion

AWS debugging isn't hard because AWS is badly designed. It's hard because AWS is optimized for capability and flexibility, not for fast investigation. Every service has its own console section, its own CLI namespace, its own log format. The information you need to diagnose an incident is spread across six services and requires you to hold the mental model together while navigating between them.

The "hard way" commands in this article are real and they work. Engineers debug AWS this way every day. But the cognitive overhead — remembering the query syntax, the CLI flags, the console navigation paths — is tax on your actual problem-solving.

Clanker Cloud doesn't replace AWS. It sits on top of it and answers questions. The same AWS credentials, the same data, the same underlying services — with a plain-English interface instead of a fragmented console.

Try Clanker Cloud free — one-minute setup, your credentials stay local. Or see a demo first.


Frequently Asked Questions

How do I debug AWS infrastructure issues?

Start by identifying which layer is failing: application (CloudWatch Logs), compute (EC2/ECS metrics), network (security groups, ALB target health), or permissions (IAM policies). Use CloudWatch Logs Insights for log analysis, aws ec2 describe-instances and aws ecs describe-services for resource state, and the IAM Policy Simulator or aws iam simulate-principal-policy for permission issues. The challenge is that each service requires separate console navigation or CLI commands, and connecting findings across services is manual.

What is the best tool for AWS troubleshooting?

For native AWS troubleshooting, CloudWatch Logs Insights handles log analysis, AWS X-Ray handles distributed tracing, and AWS Security Hub aggregates security findings. For faster investigation that correlates across services, Clanker Cloud lets you ask plain-English questions against your live AWS environment — covering CloudWatch, IAM, EC2, ECS, S3, and cost data simultaneously without switching between consoles.

How do I find the cause of an AWS cost spike?

Open AWS Cost Explorer, set daily granularity, and group by service to find which service increased. Then group by usage type within that service to pinpoint the specific usage category (On-Demand hours, data transfer, storage). Cross-reference with CloudTrail using aws cloudtrail lookup-events filtered to RunInstances, CreateBucket, or other creation events in the same time window. Check for Auto Scaling groups that scaled out unexpectedly or resources that were spun up for testing and not terminated.

Can AI debug AWS permission errors?

Yes — tools like Clanker Cloud can inspect an IAM role's full policy chain (attached policies, inline policies, resource-based policies, permission boundaries, and SCPs) and explain in plain English why a specific action is being denied and which policy needs updating. This is faster than manually walking through the IAM console or running simulate-principal-policy for each action/resource combination. Read more about AI DevOps for teams or explore the Clanker Cloud documentation.


Clanker Cloud is built by NovLabs.ai. The open-source CLI is available at github.com/bgdnvk/clanker.

Next step

Run a local security and drift review

Use Clanker Cloud to inspect live cloud and Kubernetes state with local credentials, then review findings before any infrastructure change runs.

Download Clanker CloudWatch demo