Skip to main content
Back to blog

Monitoring Your AI Agents: How Clanker Cloud Keeps Your Agent Stack Healthy

Learn how to monitor AI agents like OpenClaw and Hermes with Clanker Cloud—covering process health, infra observability, and MCP connection checks.

AI agents are no longer experiment-tier. Teams are running OpenClaw on DigitalOcean droplets to monitor cloud spend 24/7. Hermes-based cost audit agents parse billing data on a schedule. Claude Code sessions pull live infrastructure context through MCP before modifying a Terraform plan.

When any of these agents silently fail, you do not get an error message. You get silence — and eventually, an incident nobody caught.

Monitoring AI agents is an infrastructure problem. The same discipline you apply to uptime, latency, and pod restarts applies here. This article covers how Clanker Cloud gives you a unified view of agent health across the infrastructure your agents live on — and how to build a monitoring loop that catches failures before they become incidents.


The AI Agent Reliability Problem

Most teams discover agent failure the same way: something downstream breaks, someone investigates, and they find that the agent responsible stopped running hours ago.

OpenClaw's HEARTBEAT runs every 30 minutes. If it stops, OpenClaw stopped. But nothing external alerts on that by default. A Hermes-based cost visibility agent may fail silently when Ollama crashes under memory pressure — no crash report, no alert, just stale data.

Agents are production infrastructure. They need health checks, alerting, runbooks, and escalation paths. The difference from traditional services is that agents have three distinct health layers that standard infra monitoring does not capture out of the box.


What "Monitoring AI Agents" Actually Means

Agent health splits into three layers:

1. Process health — Is the agent process running at all? For a persistent agent like OpenClaw, this means checking whether the process is alive on its host. For an Ollama-based agent like Hermes, it means checking whether the Ollama server is responding. This is the most basic check and the first one to implement.

2. Infrastructure health — Is the machine the agent runs on healthy? A DigitalOcean droplet with 95% memory utilization will degrade or kill any process on it. A K8s pod in CrashLoopBackOff is not running your agent. The infrastructure layer is where Clanker Cloud provides the most direct value — it manages and queries the hosts, droplets, and clusters your agents live on.

3. Functional health — Is the agent actually doing its job? For OpenClaw, functional health means: Did HEARTBEAT execute in the last 30 minutes? Are MCP connections active? Is Slack connectivity live? A process can be running while functionally stalled — this layer catches the cases that process checks miss.

All three layers matter. A complete monitoring strategy covers all three.


Monitoring OpenClaw with Clanker Cloud

OpenClaw runs as a persistent process on a server — typically a DigitalOcean droplet or EC2 instance. That host is a Clanker Cloud managed resource, which means querying it is straightforward:

clanker ask "what's the CPU and memory on my ops-agent droplet?"

This surfaces the infrastructure health layer immediately. If the droplet is under memory pressure, that is your first signal before the process fails.

For process health, Clanker Cloud can query the droplet directly:

clanker ask "is my OpenClaw process running on the ops-agent droplet?"

Clanker Cloud connects to your DigitalOcean or EC2 provider, queries the instance, and returns a plain-language answer with the underlying data. No SSH required, no manual log tailing.

For functional health, OpenClaw's HEARTBEAT.md is the key signal. HEARTBEAT runs every 30 minutes. If the last execution timestamp in the HEARTBEAT log is more than 30 minutes old, OpenClaw has stalled. You can configure OpenClaw's HEARTBEAT.md to post its execution timestamp to a Slack channel — giving you a passive functional health signal without any external monitoring infrastructure.

There is also a recursive pattern worth building: OpenClaw monitors your infrastructure via Clanker Cloud's MCP server. Clanker Cloud monitors the infrastructure that OpenClaw runs on. Each layer watches the other — defense against the failure modes where the monitoring tool itself is unavailable.

For more on how OpenClaw and other agents connect to Clanker Cloud, see the for-ai-agents page.


Monitoring Ollama and Hermes Agents

Hermes runs as an Ollama process. Ollama exposes a health endpoint that gives you a direct process check:

curl http://localhost:11434/api/version

If Ollama is responding, the server is alive. If this returns an error or times out, the process is down. This is the baseline health probe for any Ollama-based agent.

The more interesting failure mode is resource exhaustion. Hermes loads model weights into memory — a 70B model can consume 40–45 GB of VRAM or system RAM depending on quantization. If the host is under memory pressure, Ollama may appear to be running while inference requests time out or fail silently.

For Hermes running on a K8s cluster, Clanker Cloud queries pod-level health directly:

clanker ask "what's the status of the hermes-inference pod in the ai-agents namespace?"

This returns pod phase, restart count, and recent events — the three signals that tell you whether a pod is healthy, degrading, or in a crash loop. A pod with a rising restart count and OOMKill events is a pod that will fail again. Catching that pattern before the next restart is the difference between proactive and reactive operations.

For Hermes running on a bare-metal or cloud VM, Clanker Cloud queries the host. Memory headroom is the primary signal — if available RAM is below the model's working set size, the agent is at risk.

The full monitoring pattern for Ollama agents:

  1. Check the Ollama API endpoint (process health)
  2. Check host memory headroom against model weight size (resource health)
  3. Check inference response times over a rolling window (functional health)

Monitoring MCP Connections

The MCP layer is the nervous system connecting agents to live infrastructure data. When Claude Code or Codex sessions pull cloud context before modifying infrastructure, they are doing so through the Clanker Cloud MCP server. If that server is unavailable, those agents lose their infrastructure context entirely — and may operate on stale or missing data.

This is a categorical risk. A Claude Code session that cannot reach the MCP server will not know the current state of your K8s cluster, your droplet utilization, or your recent cost anomalies — it falls back to training data and conversation context. That is not a safe state for infrastructure work.

MCP connection health is a first-class monitoring concern. The check is simple:

clanker ask "is the local MCP server reachable?"

Run this as a periodic health check — every few minutes in a CI/CD pipeline health job, or as a step in OpenClaw's HEARTBEAT.md. If the MCP server is unreachable, escalate before any agent sessions begin.

For Claude Code and Codex: because these are session-based rather than persistent processes, the monitoring focus shifts entirely to the infrastructure they connect to. The Clanker Cloud MCP server and connected providers (DigitalOcean, AWS, GKE) are what need to be healthy before a session starts. Pre-flight MCP checks are a straightforward reliability pattern.

Documentation for MCP configuration and endpoint options is at docs.clankercloud.ai.


The Meta-Loop: Agents Monitoring Agents

The most resilient monitoring architecture is recursive: your agents monitor each other's dependencies, and your infrastructure monitoring covers the agents themselves.

A practical implementation of this loop:

OpenClaw HEARTBEAT.md includes an MCP connectivity check. Every 30 minutes, HEARTBEAT probes the Clanker Cloud MCP server and logs the response. If it is unreachable for two consecutive cycles, OpenClaw posts an alert to Slack.

Clanker Cloud monitors the OpenClaw host. The droplet or EC2 instance running OpenClaw is a managed resource. Its health metrics — CPU, memory, disk, network — are available through Clanker Cloud without any additional agent configuration. If the host degrades, you know before OpenClaw fails.

A secondary agent holds the fallback. If OpenClaw is down, it cannot post its own alerts. Configure a secondary Clanker Cloud query — run from a different host or a scheduled CI job — that checks whether the OpenClaw process is alive. If it is not, this secondary check posts to Slack. The monitor needs its own monitor. No single point of failure in the observability layer.


Alerting and Escalation

When agent health checks fail, the escalation path needs to be defined before the incident, not during it.

Tier 1 — Agent self-reports: OpenClaw posts to a dedicated Slack channel when HEARTBEAT detects anomalies. This works as long as OpenClaw is running and Slack connectivity is live.

Tier 2 — Infrastructure-level alert: Clanker Cloud detects that the OpenClaw host is unhealthy (memory exhaustion, disk full, instance unreachable) and a secondary agent or CI job posts to Slack. This catches cases where OpenClaw is too degraded to report its own failure.

Tier 3 — Human on-call: If Tier 1 and Tier 2 alerts both fail, a human needs to be paged. Design your agent monitoring with the assumption that any layer of it can fail.

A single #agent-health channel with all alerts is easier to monitor than scattered mentions across team channels. Include enough context in each alert for an on-call engineer to triage without additional queries: which agent, which host, what the health check returned, and a link to Clanker Cloud documentation.


Agent Observability Beyond Health

Process and infrastructure health tell you whether the agent is running. Observability tells you what the agent is doing and whether it is useful.

Query logs are the most direct signal. Which cloud queries did the agent make in the last 24 hours? An OpenClaw session with zero queries in the last 6 hours is either idle or stalled — and you want to know which.

Token and API cost tracking gives you the operational cost of running your agent stack. A well-functioning agent stack has predictable token consumption. Spikes indicate either increased workload or runaway query loops. Clanker Cloud's demo environment shows how cost queries surface in agent context.

Accuracy signals are harder to collect but worth instrumenting. When OpenClaw surfaces a cost anomaly, was it acted on? Was the recommendation correct? A lightweight feedback loop — even a Slack reaction on the alert — gives you signal about agent effectiveness over time.

Common questions about agent setup are addressed in the FAQ.


FAQ

How do I monitor an OpenClaw agent running on a server?

Start with the infrastructure layer: use Clanker Cloud to query the host that OpenClaw runs on. clanker ask "is my OpenClaw process running on the ops-agent droplet?" gives you process status. For functional health, check the HEARTBEAT execution timestamp — if it has not run in more than 30 minutes, OpenClaw has stalled. Configure HEARTBEAT.md to post its execution timestamp to Slack so you have a passive functional health signal without additional tooling.

What should I monitor for AI agent reliability?

Monitor three layers: process health (is the agent process running?), infrastructure health (is the host machine healthy?), and functional health (is the agent doing its job?). For most production agents, functional health is the hardest to instrument and the most important — a process can be running while the agent is stalled, disconnected from its data sources, or returning incorrect results.

How do I know if my Ollama/Hermes agent is running correctly?

Check the Ollama API endpoint first: curl http://localhost:11434/api/version. If Ollama responds, the server is alive. Then check host memory headroom — if available RAM is below the model's working set size, inference will fail under load. For K8s deployments, query the pod status with Clanker Cloud to surface restart counts and recent events. Rising restart counts and OOMKill events are the key failure signals.

What happens if my AI agent's MCP connection fails?

Session-based agents like Claude Code and Codex lose their infrastructure context. They cannot query live cloud data, which means they operate on training knowledge and conversation context rather than actual system state. This is a significant operational risk for infrastructure work. Run pre-flight MCP connectivity checks before starting agent sessions, and configure alerts if the MCP server becomes unreachable. OpenClaw's HEARTBEAT.md can include an MCP probe to catch connectivity failures proactively.


Start Monitoring Your Agent Stack

Agent reliability is infrastructure reliability. The tools and patterns that make your cloud infrastructure observable apply directly to the agents running on that infrastructure — with one additional layer for functional health that is specific to agents.

Clanker Cloud gives you a unified view of the infrastructure your agents live on, with natural-language queries that work across DigitalOcean, AWS, GKE, and K8s without switching between provider consoles.

Create a free account at clankercloud.ai/account to connect your first provider and start querying your agent infrastructure. The for-ai-agents overview covers how OpenClaw, Hermes, and other agents connect to Clanker Cloud through the MCP layer.

Beta access is free. Lite is $5/month. Pro is $20/month for teams with multiple providers and agents.

Next step

Give your agent live infrastructure context

Download Clanker Cloud, expose the local MCP surface, and let coding agents work from current cloud, Kubernetes, GitHub, and cost state instead of guesses.

Download and connect MCPWatch demo