Skip to main content
Back to blog

Detecting Cloud Misconfigurations Automatically: Database Optimization and Build Time Issues

How to automatically detect cloud misconfigurations, database performance problems, and build time bloat before they cause production incidents.

Three categories of problems account for a disproportionate share of production incidents: security misconfigurations, database performance degradation, and build time bloat. Each develops gradually. A security group rule gets widened "temporarily." An index never gets added to a foreign key column. A Docker image balloons from 400 MB to 3 GB over six months.

None of these announce themselves. You discover a misconfigured S3 bucket during a compliance audit — or after a breach. You notice the connection pool timing out under peak traffic. You realize CI has gone from four minutes to twenty-two minutes when someone finally complains.

All three are detectable automatically, before they become incidents. This article covers the detection patterns for each and how to wire them into a unified daily health check using Clanker Cloud and OpenClaw HEARTBEAT.md.


Part 1: Automating Cloud Misconfiguration Detection

Why Manual Scanning Fails

Most teams scan their infrastructure manually, infrequently, or not at all. The problems with manual scanning are structural:

Frequency. A quarterly audit catches a problem that has existed for three months. A misconfigured IAM policy that grants s3:* to an unintended principal has been exploitable for 90 days before anyone notices.

Drift. Infrastructure-as-code gives you a verified baseline at deploy time. It does not track what someone changed in the AWS console at 11 PM during an incident. Config drift between your Terraform state and actual cloud state is the norm on active teams, not the exception.

Scope. Manual reviewers check the things they know to check. Automated scanners check everything against a policy library — including the things nobody thought to verify.

The Automation Pattern

Detecting cloud misconfigurations automatically follows a consistent pattern:

  1. Establish a known-good baseline. Your first scan produces the reference state.
  2. Schedule recurring scans. Every scan produces a new snapshot.
  3. Diff against the baseline. New findings since the last clean scan are flagged immediately.
  4. Alert on drift, not just absolute state. A long-standing open port may be intentional. A port that opened yesterday is almost certainly not.

What to Scan

Effective automated misconfiguration detection covers:

  • IAM policies — overly permissive inline policies, wildcard actions, cross-account trust relationships without conditions, unused roles with attached permissions
  • Security groups and firewall rules0.0.0.0/0 ingress on non-HTTP ports, SSH/RDP open to the internet, rules that have been widened but never tightened
  • S3 bucket ACLs and policies — public access blocks disabled, bucket policies granting s3:GetObject to *, server-side encryption not enforced
  • Kubernetes RBAC — cluster-admin bindings to service accounts, wildcard verb grants, namespaced roles with cross-namespace access
  • Network policies — pods with no egress restriction, services exposed as type: LoadBalancer unintentionally
  • Public endpoints — RDS instances with publicly_accessible = true, ElasticSearch clusters with no VPC placement, Redis with no auth token

OpenClaw HEARTBEAT.md + Clanker Cloud MCP

Clanker Cloud's MCP integration allows AI agents to trigger and retrieve scan results programmatically. With OpenClaw HEARTBEAT.md, you configure a recurring task that runs on a schedule:

## HEARTBEAT Tasks

### daily-infra-scan
schedule: every 24 hours
task: |
  Use the clanker_cloud MCP tool to run a full security scan.
  Compare results to yesterday's report.
  If new findings exist, generate a summary with affected resources and severity.
  Write the diff to /reports/security-scan-{date}.md

The agent triggers the scan, retrieves findings via MCP, diffs against the previous report, and surfaces only net-new issues. This avoids alert fatigue from re-reporting known findings and ensures nothing new slips through unnoticed.

Clanker Cloud supports AWS, GCP, Azure, Kubernetes, Cloudflare, Hetzner, DigitalOcean, and GitHub — one scan covers your full footprint. Scan results stay on your machine. See the AI agents integration page for setup details.


Part 2: Database Performance Misconfigurations

The DB Misconfigs That Kill Production Systems

Database performance problems rarely start as emergencies. They start as slowness. A query that takes 40 ms at 100 requests per second takes 4,000 ms at 10,000 requests per second if it's doing a sequential scan on a 50 million row table. The query was always wrong; traffic revealed it.

These are the most common database misconfigurations that cause production pain:

Missing indexes on foreign keys and filter columns. Every foreign key column without an index is a sequential scan waiting to happen. The same applies to any column that appears in a WHERE, JOIN ON, or ORDER BY clause in a frequently executed query.

-- Find tables with foreign keys that have no supporting index (PostgreSQL)
SELECT
  tc.table_name,
  kcu.column_name,
  ccu.table_name AS foreign_table
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
  ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
  ON ccu.constraint_name = tc.constraint_name
LEFT JOIN pg_indexes pi
  ON pi.tablename = tc.table_name
  AND pi.indexdef LIKE '%' || kcu.column_name || '%'
WHERE tc.constraint_type = 'FOREIGN KEY'
  AND pi.indexname IS NULL;

N+1 queries not caught until production load. One query fetches 500 rows. The ORM then fires 500 individual queries to load related records. At low traffic, this is invisible. At scale, it serializes your database.

Connection pool too small. If your application pool is configured for 10 connections and a traffic spike brings 50 concurrent requests, 40 of them queue or fail. PostgreSQL's max_connections defaults to 100; RDS db.t3.micro instances are commonly misconfigured with pool sizes that saturate under any real load.

Autovacuum disabled or poorly tuned. PostgreSQL's autovacuum reclaims dead tuple space and updates statistics for the query planner. Disabled or poorly tuned autovacuum causes table bloat and degraded query plans. The symptom is queries that progressively slow down over weeks until a manual VACUUM ANALYZE is run and performance jumps back to baseline.

Check autovacuum activity:

SELECT
  relname,
  last_autovacuum,
  last_autoanalyze,
  n_dead_tup,
  n_live_tup
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

Slow query log not enabled. Without slow query logging, you have no visibility into what's actually slow. This is not a performance problem itself — it's a visibility problem that prevents you from detecting every other performance problem.

For PostgreSQL: log_min_duration_statement = 1000 (log queries over 1 second). For MySQL/RDS: slow_query_log = 1, long_query_time = 1.

Read replicas not configured for read-heavy workloads. Analytics, reporting, and background jobs running against the primary is a common pattern on teams that scaled faster than their architecture. The fix is routing read traffic to a replica; the problem is that nobody checks whether the routing is actually happening.

Detecting DB Issues with Clanker Cloud

Clanker Cloud connects to your cloud provider's database services and allows plain-English queries against metrics and configuration. From the Clanker Cloud workspace:

  • "What are the slowest queries on my production RDS instance over the last 24 hours?" — pulls from pg_stat_statements or CloudWatch slow query logs depending on your configuration
  • "Is autovacuum running on my PostgreSQL database?" — checks pg_stat_user_tables for tables with high dead tuple counts and recent autovacuum timestamps
  • "How many active connections is my database handling right now, and how does that compare to my connection pool limit?" — surfaces connection pool saturation before it causes timeouts
  • "What indexes are missing on my users table?" — analyzes query patterns against table schema to identify high-impact missing indexes

These queries can also be embedded in an OpenClaw HEARTBEAT.md task alongside the security scan, producing a unified daily database health report without manual intervention. The AI DevOps for teams guide covers agent task patterns in more depth.


Part 3: Build Time Issues

How Builds Silently Double Over Six Months

Build time bloat degrades gradually. A build that took 4 minutes in January takes 9 minutes in July. It wasn't one change — it was 40 small changes, each adding 8–15 seconds: a new dependency, a test suite that grew, an image layer that stopped caching because a frequently-changed file was added too early in the Dockerfile.

Nobody filed a ticket. The slowdown stayed below the threshold of complaint until it became a significant drag on every deployment.

Common Causes with Concrete Examples

Bloated Docker images from missing multi-stage builds. A Node.js application that includes devDependencies in the final image because nobody converted the Dockerfile to multi-stage. A Python image that includes the full build toolchain (gcc, python3-dev) needed only for compiling a single extension.

# Before: single stage, 2.4 GB image
FROM node:20
COPY package*.json ./
RUN npm install   # includes all devDependencies
COPY . .
RUN npm run build

# After: multi-stage, 380 MB image
FROM node:20 AS builder
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build

FROM node:20-slim
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules

No build cache in CI. Every CI run reinstalls all npm packages from scratch because the cache key is set to ${{ github.sha }} instead of a hash of package-lock.json. This turns a 30-second dependency install into a 4-minute one, on every push.

# Correct cache key for npm in GitHub Actions
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      ${{ runner.os }}-npm-

Sequential test runs that should be parallelized. A test suite that runs 3,000 tests sequentially in 18 minutes. Split across four runners with --shard=1/4 through --shard=4/4, total wall-clock time drops to under 5 minutes.

No layer caching in Kubernetes builds. Kaniko or Buildkit builds in Kubernetes pods that start from a clean state on every run because the cache registry isn't configured.

GitHub Actions minutes ballooning from large actions/checkout without fetch-depth limits. A repository with 5 years of history where every job does a full clone adds 45–90 seconds to every workflow. Setting fetch-depth: 1 for jobs that don't need full history eliminates this entirely.

Detecting Build Time Trends with Clanker Cloud

Clanker Cloud's GitHub integration pulls workflow run data and surfaces build time trends. From the workspace:

  • "Show me the build time trend for my main service over the last 30 days" — plots workflow duration over time, surfaces the inflection points where build time increased
  • "Which GitHub Actions workflows have the highest p95 duration this week?" — identifies the worst offenders across all repositories
  • "What changed in my Dockerfile in the last 60 days?" — correlates image size changes with specific commits

The build time trend query is most useful for finding the inflection point — the specific commits or workflow changes that degraded build performance — rather than auditing changelogs manually.


The Unified Detection Workflow

All three categories — security misconfigurations, database performance issues, and build time problems — can be automated into a single daily health check.

A HEARTBEAT.md configuration that covers all three:

## HEARTBEAT Tasks

### daily-infra-health
schedule: every 24 hours at 07:00
task: |
  1. Run clanker_cloud security scan across all connected providers.
     Diff against yesterday's report. Summarize net-new findings.

  2. Query database metrics:
     - Top 10 slowest queries by average duration (pg_stat_statements)
     - Tables with dead tuple count > 50,000 (autovacuum health)
     - Current connection pool utilization vs. max_connections

  3. Pull GitHub Actions build time trends:
     - P95 build time for production workflows over last 7 days
     - Any workflows that increased more than 20% week-over-week

  Write combined report to /reports/infra-health-{date}.md
  Alert on: new critical/high security findings, queries > 5s avg,
            connection pool > 80% utilized, build time increase > 20% WoW

The result is a single report, generated every morning, covering the three categories most likely to produce a production incident before the end of the week. No dashboard to check. No manual queries to run. One file that surfaces what changed and what needs attention.

This approach is detailed further in the AI DevOps for teams documentation. For questions about what's supported, see the FAQ.


Setting It Up

Step 1: Connect Clanker Cloud to your infrastructure.

From the Clanker Cloud workspace, connect your cloud providers (AWS, GCP, Azure, Kubernetes, Cloudflare, Hetzner, DigitalOcean) and GitHub. Each integration uses read-only credentials scoped to the minimum permissions required for scanning.

Step 2: Run an initial scan.

Trigger a manual scan from the security scanning mode to establish your baseline. Review findings and mark any intentional configurations as exceptions to avoid them surfacing as findings in every subsequent report.

Step 3: Configure OpenClaw HEARTBEAT.md.

Add a daily-infra-health task to your HEARTBEAT.md file using the template above. Adjust the alert thresholds to match your environment (a connection pool at 80% utilization may be normal for some workloads; adjust accordingly).

Step 4: Connect Claude Code or your preferred AI agent.

Clanker Cloud's MCP server exposes scan, query, and trend tools that Claude Code, Codex, or Hermes (via Ollama with Gemma 4) can call directly. The agent integration documentation covers the MCP configuration for each supported AI.

Step 5: Review daily reports and investigate flagged issues.

When a report flags something — a new open security group rule, a query that jumped from 200 ms to 3 s average, a build taking twice as long — open Clanker Cloud and ask the follow-up question in plain English. Scan data stays local; queries return fast.


FAQ

How do I automatically detect cloud misconfigurations?

Scheduled scans with diff-based alerting. Connect your cloud accounts to a scanning tool, run a baseline scan, then configure recurring scans that compare results against the previous report. Alert only on net-new findings to avoid fatigue from re-reporting known state. With OpenClaw HEARTBEAT.md + Clanker Cloud MCP, this runs fully autonomously — the agent scans, diffs, and writes a summary on a schedule you define.

What are common database performance misconfigurations in AWS RDS or PostgreSQL?

Missing indexes on foreign key and filter columns (sequential scans at scale), connection pool too small for peak concurrency (queuing and timeouts under load), autovacuum disabled or poorly tuned (table bloat and degraded query plans), slow query logging not enabled (no visibility into what's slow), and pg_stat_statements not installed. For RDS specifically: publicly_accessible = true on instances that don't need internet exposure, and no read replica configured when read traffic is hitting the primary.

Why are my CI/CD builds getting slower over time?

Gradual accumulation: new dependencies, Docker layer ordering that busts the cache on every change, test suites that grew without parallelization, and cache keys tied to commit hash instead of lock file hash. No single change is the culprit — it's 40 small changes over six months. Pull build time trend data from GitHub Actions and find the inflection points, then correlate with what changed. Clanker Cloud's GitHub integration surfaces this without manual log analysis.

How do I set up automated infrastructure health checks?

Connect cloud providers and GitHub to Clanker Cloud, run an initial scan to establish a baseline, then configure a HEARTBEAT.md task covering three areas: security misconfiguration scan (diff against previous report), database performance metrics (slow queries, connection pool utilization, autovacuum health), and build time trends (workflow duration week-over-week). Full setup documentation is at docs.clankercloud.ai.


Get Started

Clanker Cloud is in beta and free to try. Connect your first cloud provider at clankercloud.ai/account and run an initial scan. The full documentation covers MCP configuration, HEARTBEAT.md patterns, and supported cloud providers.

Lite plan starts at $5/month. Pro at $20/month. Enterprise pricing is available for teams that need custom retention, SSO, or dedicated support.