Most infrastructure teams are one pager away from a bad night. A pod crashes, memory climbs past its limit, a database primary goes down — and someone has to wake up, SSH in, and figure out what happened. AIOps self-healing systems close that gap: detect the fault, diagnose it, and take corrective action automatically or with minimal human intervention.
The range is wide. At the simple end: Kubernetes liveness probes that restart a stuck container. At the sophisticated end: an AI agent that detects an OOMKill loop, queries memory trends, drafts a patch to the Deployment manifest, and posts it to Slack for one-click approval. Most teams have level one. Getting to level four requires less engineering than most people assume.
This is a concrete walkthrough of all four levels — what each covers, how to configure it, where each breaks down, and how to layer them.
Level 1: Kubernetes Native Self-Healing
Kubernetes ships with self-healing mechanisms that cost nothing beyond correct configuration. These are the baseline.
Liveness Probes
A liveness probe tells Kubernetes whether a container is alive. If the probe fails consistently, Kubernetes restarts it. This handles failures where a process is running but stuck — a deadlock, an unresponsive HTTP handler, a goroutine that never returns.
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
Common mistake: Setting initialDelaySeconds too low. If the probe fires before the application is ready, it kills a healthy container during startup. Set it to at least your 95th-percentile startup time. Also avoid using the same endpoint for liveness and readiness — they serve different purposes.
Readiness Probes
A readiness probe tells Kubernetes whether a container is ready to receive traffic. When it fails, Kubernetes removes the pod from the Service endpoint — traffic stops routing to it, but the pod is not restarted. This is the right behavior during startup, during a dependency outage, or during temporary overload.
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
The readiness endpoint should verify actual dependencies — database connectivity, cache availability, any downstream service the app requires.
HorizontalPodAutoscaler
HPA scales replica count up when CPU or memory utilization exceeds a target and scales down when load drops.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Common mistake: Setting averageUtilization to 80–90. By the time HPA reacts and new pods become ready, you are already saturated. Target 50–65%.
PodDisruptionBudgets
PDBs prevent Kubernetes from evicting too many pods simultaneously during node maintenance or cluster upgrades.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api
Without one, a node drain during an upgrade can take down all replicas of a small Deployment simultaneously.
Cluster Autoscaler
Cluster Autoscaler watches for pending pods that cannot be scheduled due to insufficient node capacity. When it detects them, it provisions new nodes. When nodes are underutilized for a sustained period, it drains and terminates them. This handles the scenario where HPA tries to scale but there are no nodes available. It should be enabled in any cluster that handles variable load.
These five mechanisms are free, require no additional tooling, and should be table stakes for any production Kubernetes workload.
Level 2: Platform-Level Self-Healing
Cloud platforms include self-healing at the infrastructure layer, covering failures that K8s probes cannot handle — failed EC2 instances, database primary failures, network-level routing.
Auto Scaling Groups (AWS ASG / GCP MIG)
An ASG maintains a desired instance count. If an instance fails its EC2 health check or ELB health check, the ASG terminates it and launches a replacement. This happens automatically, without any runbook or operator action.
The critical configuration: enable ELB health checks on your ASG, not just EC2 health checks. EC2 health checks only catch catastrophic instance failure; ELB health checks catch application-level failures like a process returning 500s.
The gap: ASG replaces the instance but does not diagnose why it failed. If your service is OOMKilling repeatedly, ASG will cycle through replacements indefinitely. It is a recovery mechanism, not a diagnostic one.
RDS Multi-AZ Failover
With Multi-AZ enabled, RDS maintains a synchronous standby in a separate availability zone. If the primary fails, AWS promotes the standby automatically within 60–120 seconds and updates DNS. Applications that use the RDS endpoint and implement connection retry with backoff reconnect transparently. Without retries, the failover window produces errors.
Load Balancer Health Checks
ALB and GCP Cloud Load Balancing continuously health-check registered targets and stop routing to unhealthy ones. This operates at the infrastructure layer, distinct from Kubernetes readiness probes, and applies to any compute target. Default AWS ALB settings (30-second interval, 5 consecutive failures) allow up to 2.5 minutes of unhealthy routing — tighten these for latency-sensitive services.
Level 3: Runbooks-as-Code
Levels 1 and 2 handle infrastructure-level recovery. They cannot take application-level corrective actions — flushing a stuck queue, restarting a specific ECS task, scaling a service in response to a business metric. That is where runbooks-as-code come in.
The pattern: a CloudWatch alarm or Datadog monitor fires, invokes a Lambda function or SSM Automation document, and the runbook takes a predefined corrective action.
Example: CloudWatch Alarm → Lambda → ECS Task Restart
import boto3
def handler(event, context):
ecs = boto3.client('ecs')
# List tasks in the service
tasks = ecs.list_tasks(
cluster='production',
serviceName='worker-service'
)['taskArns']
# Stop all tasks — ECS will replace them automatically
for task_arn in tasks:
ecs.stop_task(
cluster='production',
task=task_arn,
reason='Automated remediation: high error rate alarm'
)
return {'stopped': len(tasks)}
Wire this to a CloudWatch alarm on your service's error rate. When it crosses the threshold, the Lambda triggers, stops the stuck tasks, and ECS launches fresh replacements within seconds.
AWS Systems Manager Automation
SSM Automation documents are YAML-defined runbooks that run AWS API calls, shell commands, and scripts across EC2 instances. They suit multi-step remediation — taking a snapshot before modifying a resource, or running a diagnostic before attempting a fix. The pattern extends to any observable failure: high disk usage triggers a Lambda that prunes log files; a stuck Celery queue triggers a runbook that restarts workers; a certificate nearing expiry triggers renewal via ACM.
The Limitation of Rule-Based Runbooks
Runbooks handle known failure modes reliably. The problem is novel failures — a memory leak that manifests as gradual degradation rather than an error spike, a deployment that broke one API endpoint but not the overall error rate, a database connection pool exhausted by a newly introduced bug. Rule-based automation pattern-matches. When the pattern is new, the runbook does not fire.
Level 4: AI-Augmented Self-Healing
AI agents handle the novel failure — the situation that does not match any predefined rule, where a human would need to inspect the system, understand the context, and decide what to do.
ClankerCloud.ai addresses this through its MCP-based agent architecture. Running OpenClaw with a HEARTBEAT.md task file, the agent has read access to your live infrastructure state: Kubernetes events, pod logs, metrics, deployment configurations, CloudWatch data. It can query, reason, and act.
How OpenClaw HEARTBEAT.md Works
HEARTBEAT.md is a persistent task file that OpenClaw executes on a schedule — every 30 minutes by default. It defines what to check, what thresholds trigger action, and what the agent can do autonomously versus what requires human escalation.
A minimal example:
# HEARTBEAT Task
## Schedule
Every 30 minutes
## Checks
1. Query Kubernetes events for OOMKilled pods in the last 30 minutes
2. Check pod restart counts across all Deployments in production namespace
3. Verify HPA utilization — flag if any HPA is at max replicas for >15 minutes
4. Check RDS connection count — alert if >80% of max_connections
## Actions (auto-approve)
- Restart a Deployment with >10 restarts in 30 min if cause is transient (CrashLoopBackOff, not OOMKilled)
## Actions (escalate to Slack #incidents with proposed fix)
- Memory limit adjustments (generate patch, do not apply)
- Scaling configuration changes
- Any action touching production databases
The agent runs checks via Clanker Cloud's MCP integration, interprets the results, and decides which action category applies. For a CrashLoopBackOff with a known cause, it restarts the Deployment. For an OOMKill loop, it investigates memory trends, generates a proposed manifest patch, and escalates.
The Three Agent Modes
For any detected problem, the agent chooses one of three paths:
Auto-fix — Pre-authorized safe actions. Restarting a known-flaky service, flushing a stuck queue, scaling a Deployment within pre-defined bounds. These execute immediately without human approval.
Investigate and escalate — The agent gathers context (logs, metrics, recent deployments, related events), diagnoses root cause, proposes a specific fix, and posts to Slack with the full diagnosis and a one-click approval link. The engineer gets a complete picture, not just an alert.
Hold for approval — For risky changes (modifying production database config, replica counts above a threshold, any destructive action), the agent proposes but does not act until a human approves.
This is a different experience from traditional on-call. The engineer is not paged to figure out what is wrong — they are paged with a specific proposed fix and can approve it in seconds.
The Read-First / Act-Second Safety Model
Automated remediation at level 4 raises a legitimate concern: what happens when the agent is wrong?
Clanker Cloud's answer is the read-first/act-second model. By default, agents operate in read mode — they query infrastructure state, inspect logs, pull metrics, and analyze configurations. They cannot modify anything until they have formulated a specific plan.
Even for pre-approved actions, the agent generates a plan first and logs it. For escalated actions, the plan is posted to your team before any API call is made. This is auditable by design — not a limitation, but the correct architecture for autonomous systems on production infrastructure.
You can explore this further in the Clanker Cloud demo or review the agent integration documentation.
Real Scenario: OOMKilled Pod Loop
Here is how the four levels interact with a concrete failure.
The situation: A Python API service is being OOMKilled repeatedly. Kubernetes restarts it each time, but within minutes it exhausts its memory limit again. The service is degraded — it responds during brief windows after restart but crashes before handling sustained load.
Level 1 response: The liveness probe detects an unresponsive container. Kubernetes restarts it. This is happening correctly, but it does not fix the root cause — it just keeps the service partially available.
Level 2 response: Nothing. This is not an instance failure. The EC2 instance is healthy. RDS is healthy. The ASG does not act.
Level 3 response: The CloudWatch alarm for high restart count fires, but there is no runbook mapped to OOMKill events. The alarm triggers a PagerDuty notification. An engineer is paged.
Level 4 response: OpenClaw HEARTBEAT detects OOMKilled events in the Kubernetes event stream via Clanker Cloud MCP. It queries memory usage over the last 24 hours and identifies a trend: memory consumption has grown 40% since the last deployment three days ago. It retrieves the current Deployment manifest and compares resources.limits.memory against the observed peak. It generates a patch:
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi"
It posts to #incidents in Slack: "OOMKill loop detected on api Deployment. Memory limit appears undersized relative to recent usage. Proposed fix: increase limit from 512Mi to 1Gi. [Approve] [Dismiss] [View full diagnosis]"
The engineer reviews the diagnosis in 30 seconds and clicks Approve. Clanker Cloud applies the patch. The loop stops. Total time from detection to resolution: under five minutes, with no middle-of-the-night debugging session.
What to Implement First
Self-healing infrastructure is not an all-or-nothing project. The right sequence:
Today: Add liveness and readiness probes to every production Deployment. Verify your HPA is configured with a reasonable target utilization (50–65%). Add PodDisruptionBudgets for any service with fewer than four replicas.
This week: Verify your platform-level failover is configured. Enable ELB health checks on your ASGs. Confirm RDS Multi-AZ is active on your primary database. Test failover — actually run a database failover drill and confirm your application reconnects cleanly.
This month: Build one runbook for your most common failure mode. What alert fires most often? What is the manual remediation step? Automate that specific step with a Lambda or SSM document. Start with the simplest case.
When you are ready: Layer AI-augmented monitoring via Clanker Cloud and OpenClaw. Start in read-only mode — let the agent observe and report before giving it any auto-fix authority. Build confidence in the diagnosis quality before enabling autonomous actions. The for-ai-agents documentation covers the MCP integration in detail.
For a broader look at AI-native DevOps workflows, see the AI DevOps for Teams guide.
FAQ
What is a self-healing infrastructure system?
A self-healing infrastructure system is any combination of tools and configurations that automatically detects and corrects infrastructure failures without requiring manual operator intervention. This ranges from Kubernetes automatically restarting a failed container (simple, native) to an AI agent diagnosing a novel failure, generating a remediation plan, and applying it with human approval (sophisticated, AI-augmented). Most production systems implement multiple layers, each handling a different class of failure.
How do Kubernetes liveness and readiness probes work?
Liveness probes check whether a container is alive. If the probe fails a configured number of times, Kubernetes restarts the container. Readiness probes check whether a container is ready to receive traffic. If the probe fails, Kubernetes removes the pod from the Service endpoint without restarting it. Both probes are configured in the Pod spec and can use HTTP GET, TCP socket, or exec checks. The key distinction: liveness failure triggers a restart; readiness failure triggers traffic removal but not a restart.
What is the difference between rule-based automation and AIOps self-healing?
Rule-based automation (runbooks, Lambda triggers, SSM documents) executes predefined actions in response to predefined conditions. It works reliably for known failure modes but cannot handle novel situations — if the failure pattern does not match a rule, the automation does not fire. AIOps self-healing uses AI agents that can reason about live system state, interpret logs and metrics, diagnose root causes without a predefined rule, and propose or execute context-appropriate fixes. The AI agent handles the long tail of failures that rule-based systems miss.
How do I set up automated remediation for Kubernetes?
Start with native mechanisms: liveness probes, readiness probes, HPA, and PodDisruptionBudgets. These are configured in your Deployment manifests and provide free self-healing for the most common failure classes. For application-level remediation, wire CloudWatch or Datadog alerts to Lambda functions that take corrective actions (restart service, scale replica count). For AI-augmented remediation, use a platform like Clanker Cloud with OpenClaw's HEARTBEAT.md to run continuous health checks and intelligent diagnosis. The Clanker Cloud FAQ covers setup steps in more detail.
Start Building Self-Healing Infrastructure
The Kubernetes probes are free and take 30 minutes to configure correctly. The platform-level failover requires a configuration review. The first runbook takes a day. The AI-augmented layer takes an afternoon to connect.
None of this requires a dedicated reliability engineering team. It requires prioritizing the work and adding layers incrementally.
Start with Clanker Cloud — Beta is free, or review the agent integration documentation to understand how OpenClaw connects to your infrastructure.
Need the product-level answer?
Use the DevOps page for the stable product answer on incident investigation, plan review, and local-first infrastructure operations.
