agent-health-diagnostics

Diagnose and fix the 4 most common OpenClaw agent failures — heartbeat spam, API rate limit cascades, channel death loops, and memory/embedding errors. Battle-tested across a 6-agent multi-host deployment.

3,891 stars

byopenclaw

View on GitHub Installation ↓

Best use case

agent-health-diagnostics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using agent-health-diagnostics should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/agent-health-diagnostics/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/agenthyjack/agent-health-diagnostics/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/agent-health-diagnostics/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How agent-health-diagnostics Compares

Feature / Agent	agent-health-diagnostics	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

AI Agent for YouTube Script Writing

Find AI agent skills for YouTube script writing, video research, content outlining, and repeatable channel production workflows.

SKILL.md Source

# Agent Health Diagnostics

**Scripts available in the [Collective Skills repo](https://github.com/Bobalouie44/collective-skills/tree/main/references)**

## Overview

When an OpenClaw agent misbehaves — spamming messages, going dark, burning API credits, or looping on dead channels — this skill provides the diagnostic playbook. Covers the 4 most common failure modes with exact commands to diagnose and fix each one.

Battle-tested across a 6-agent deployment spanning 3 hosts (Windows + Linux + Proxmox).

## When to Use This Skill

Use when you observe any of these symptoms:
- Agent sending repeated heartbeat/status messages to Telegram/Discord/etc.
- Agent goes silent despite gateway showing "active"
- Logs show `429 Too many tokens` or `rate_limit` errors
- Channel connection loops: `auto-restart attempt 1/10`, `2/10`, etc.
- Memory search errors: `input length exceeds context length`
- Gateway says "active" but agent doesn't respond to messages

## The 4 Failure Modes

### 1. Heartbeat Spam
**Symptom:** Agent sends repeated messages every N minutes.
**Root cause:** Heartbeat interval too low (10m = 144 messages/day) + verbose prompt that always generates output instead of HEARTBEAT_OK.
**Quick fix:**
```bash
# Check interval
grep -A5 heartbeat ~/.openclaw/openclaw.json

# Fix: set to 30m minimum, simplify prompt to checklist + HEARTBEAT_OK default
# Then restart gateway
openclaw gateway restart
```
**Prevention:** Never set heartbeat below 20 minutes. Heartbeat prompts should CHECK things, not CREATE things.

### 2. API Rate Limit Cascade
**Symptom:** All models fail, agent goes dark.
**Root cause:** Heartbeat + N crons = (N+1) API calls per interval. Exceeds provider TPM limit → all fallbacks exhausted simultaneously.
**Quick fix:**
```bash
# Check for rate limits
journalctl -u <service> --since '1h ago' | grep '429\|rate_limit'

# Count your crons (each burns tokens)
openclaw cron list

# Fix: reduce heartbeat to 30-60m, disable non-essential crons, stagger schedules
```
**Prevention:** Calculate token budget before adding crons. Each run ≈ 2K-10K tokens. Route heartbeats to cheap/local models.

### 3. Channel Death Loop
**Symptom:** Logs show repeated `auto-restart attempt N/10` for IRC/Discord/etc.
**Root cause:** Target server unreachable → health monitor restarts → fails again → loop. Each restart may trigger model calls, burning API tokens.
**Quick fix:**
```bash
# Check for loops
journalctl -u <service> --since '1h ago' | grep 'auto-restart\|timed out'

# Test connectivity
nc -zv <target-ip> <target-port> -w 5

# Fix: disable the broken channel in openclaw.json
# channels.<name>.enabled = false
openclaw gateway restart
```
**Prevention:** Test connectivity BEFORE enabling channels. Disable channels you can't reach.

### 4. Memory/Embedding Overflow
**Symptom:** `memory sync failed` or `input length exceeds context length` errors.
**Root cause:** File too large for embedding model's context window (mxbai-embed-large = 8K tokens).
**Quick fix:** Archive old sections of large files (MEMORY.md → memory/archive/). Keep active files under 8K tokens.
**Prevention:** Don't let MEMORY.md grow unbounded. Archive quarterly.

## Remote Diagnostic Quick Reference

| What | Command |
|------|---------|
| Service status | `systemctl is-active <service>` |
| Recent logs | `journalctl -u <service> --since '1h ago' --no-pager \| tail -40` |
| Live tail | `journalctl -u <service> -f` |
| Rate limits | `journalctl -u <service> --since '1h ago' \| grep '429'` |
| Cron list | `openclaw cron list` |
| Port test | `nc -zv <ip> <port> -w 5` |
| Config backup | `cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak` |

## Golden Rules

1. **Always back up config before editing.** `cp openclaw.json openclaw.json.bak`
2. **Always restart gateway after config changes.** Hot reload doesn't catch everything.
3. **Check logs before guessing.** `journalctl` tells you what's wrong 90% of the time.
4. **Calculate your API budget.** Heartbeat freq × (crons + 1) × avg tokens = burn rate.
5. **Disable what you can't reach.** Dead channels create loops that waste resources.
6. **"Configured" ≠ "working."** Verify with actual output after every change.

Related Skills

botlearn-healthcheck

3891

from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure

doctorbot-healthcheck-free

3891

from openclaw/skills

🩺 Free Security & Health Audit. Your OpenClaw deserves a check-up. This skill performs a non-invasive scan to detect security risks, outdated software, and misconfigurations.

healthy-meal-reminder

3891

from openclaw/skills

健康饮食提醒技能。每日三餐+下午茶定时提醒，每次3个方案ABC供选择，饭后30分钟自动跟进记录饮食+计算热量。按季节推荐应季低卡食谱，支持减肥/维持/增肌三模式，含运动搭配、周末放纵餐、互动问答和周报打卡。当用户提到饮食提醒、三餐提醒、吃什么、减肥食谱、健康饮食、meal reminder、吃了什么、体重打卡时激活。

health-check

3891

from openclaw/skills

每日安全检查。检查 OpenClaw Gateway、磁盘空间、内存使用等系统健康状态。触发时机：cron 定时任务或手动调用。

session-health-monitor

3891

from openclaw/skills

Context window health monitoring for OpenClaw agents — threshold warnings via Telegram, pre-compaction snapshots, and memory rotation.

Healthcheck Readiness Starter Skill

3891

from openclaw/skills

Description: Performs a quick risk posture check on the host and reports basic security/posture status.

healthkit-code-review

3891

from openclaw/skills

Reviews HealthKit code for authorization patterns, query usage, background delivery, and data type handling. Use when reviewing code with import HealthKit, HKHealthStore, HKSampleQuery, HKObserverQuery, or HKQuantityType.

bluebubbles-healthcheck

3891

from openclaw/skills

Diagnoses and auto-heals BlueBubbles ↔ OpenClaw iMessage connectivity. Use when: iMessages stop arriving after a gateway restart, webhook connection is broken, or user reports messages not coming through. Runs a 4-step diagnostic and auto-fixes webhook backoff, stale registrations, and gateway issues.

Huangdi Health Timer

3891

from openclaw/skills

12 two-hour energy cycles, 3 unique tips daily.

org-health-diagnostic

3891

from openclaw/skills

Cross-functional organizational health check combining signals from all C-suite roles. Scores 8 dimensions on a traffic-light scale with drill-down recommendations. Use when assessing overall company health, preparing for board reviews, identifying at-risk functions, or when user mentions org health, health check, or health dashboard.

healthie

3891

from openclaw/skills

Healthie — manage patients, appointments, goals, and documents via GraphQL API

---

3891

from openclaw/skills

name: article-factory-wechat

Content & Documentation

agent-health-diagnostics

Best use case

When to use this skill

When not to use this skill

Installation

How agent-health-diagnostics Compares

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

Related Guides

Best AI Skills for Claude

AI Agents for Coding

AI Agent for YouTube Script Writing

SKILL.md Source

Related Skills

botlearn-healthcheck

doctorbot-healthcheck-free

healthy-meal-reminder

health-check

session-health-monitor

Healthcheck Readiness Starter Skill

healthkit-code-review

bluebubbles-healthcheck

Huangdi Health Timer

org-health-diagnostic

healthie

﻿---

---