agent-health-diagnostics

Diagnose and fix the 4 most common OpenClaw agent failures — heartbeat spam, API rate limit cascades, channel death loops, and memory/embedding errors. Battle-tested across a 6-agent multi-host deployment.

3,891 stars

Best use case

agent-health-diagnostics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Diagnose and fix the 4 most common OpenClaw agent failures — heartbeat spam, API rate limit cascades, channel death loops, and memory/embedding errors. Battle-tested across a 6-agent multi-host deployment.

Teams using agent-health-diagnostics should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/agent-health-diagnostics/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/agenthyjack/agent-health-diagnostics/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/agent-health-diagnostics/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How agent-health-diagnostics Compares

Feature / Agentagent-health-diagnosticsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Diagnose and fix the 4 most common OpenClaw agent failures — heartbeat spam, API rate limit cascades, channel death loops, and memory/embedding errors. Battle-tested across a 6-agent multi-host deployment.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Agent Health Diagnostics

**Scripts available in the [Collective Skills repo](https://github.com/Bobalouie44/collective-skills/tree/main/references)**

## Overview

When an OpenClaw agent misbehaves — spamming messages, going dark, burning API credits, or looping on dead channels — this skill provides the diagnostic playbook. Covers the 4 most common failure modes with exact commands to diagnose and fix each one.

Battle-tested across a 6-agent deployment spanning 3 hosts (Windows + Linux + Proxmox).

## When to Use This Skill

Use when you observe any of these symptoms:
- Agent sending repeated heartbeat/status messages to Telegram/Discord/etc.
- Agent goes silent despite gateway showing "active"
- Logs show `429 Too many tokens` or `rate_limit` errors
- Channel connection loops: `auto-restart attempt 1/10`, `2/10`, etc.
- Memory search errors: `input length exceeds context length`
- Gateway says "active" but agent doesn't respond to messages

## The 4 Failure Modes

### 1. Heartbeat Spam
**Symptom:** Agent sends repeated messages every N minutes.
**Root cause:** Heartbeat interval too low (10m = 144 messages/day) + verbose prompt that always generates output instead of HEARTBEAT_OK.
**Quick fix:**
```bash
# Check interval
grep -A5 heartbeat ~/.openclaw/openclaw.json

# Fix: set to 30m minimum, simplify prompt to checklist + HEARTBEAT_OK default
# Then restart gateway
openclaw gateway restart
```
**Prevention:** Never set heartbeat below 20 minutes. Heartbeat prompts should CHECK things, not CREATE things.

### 2. API Rate Limit Cascade
**Symptom:** All models fail, agent goes dark.
**Root cause:** Heartbeat + N crons = (N+1) API calls per interval. Exceeds provider TPM limit → all fallbacks exhausted simultaneously.
**Quick fix:**
```bash
# Check for rate limits
journalctl -u <service> --since '1h ago' | grep '429\|rate_limit'

# Count your crons (each burns tokens)
openclaw cron list

# Fix: reduce heartbeat to 30-60m, disable non-essential crons, stagger schedules
```
**Prevention:** Calculate token budget before adding crons. Each run ≈ 2K-10K tokens. Route heartbeats to cheap/local models.

### 3. Channel Death Loop
**Symptom:** Logs show repeated `auto-restart attempt N/10` for IRC/Discord/etc.
**Root cause:** Target server unreachable → health monitor restarts → fails again → loop. Each restart may trigger model calls, burning API tokens.
**Quick fix:**
```bash
# Check for loops
journalctl -u <service> --since '1h ago' | grep 'auto-restart\|timed out'

# Test connectivity
nc -zv <target-ip> <target-port> -w 5

# Fix: disable the broken channel in openclaw.json
# channels.<name>.enabled = false
openclaw gateway restart
```
**Prevention:** Test connectivity BEFORE enabling channels. Disable channels you can't reach.

### 4. Memory/Embedding Overflow
**Symptom:** `memory sync failed` or `input length exceeds context length` errors.
**Root cause:** File too large for embedding model's context window (mxbai-embed-large = 8K tokens).
**Quick fix:** Archive old sections of large files (MEMORY.md → memory/archive/). Keep active files under 8K tokens.
**Prevention:** Don't let MEMORY.md grow unbounded. Archive quarterly.

## Remote Diagnostic Quick Reference

| What | Command |
|------|---------|
| Service status | `systemctl is-active <service>` |
| Recent logs | `journalctl -u <service> --since '1h ago' --no-pager \| tail -40` |
| Live tail | `journalctl -u <service> -f` |
| Rate limits | `journalctl -u <service> --since '1h ago' \| grep '429'` |
| Cron list | `openclaw cron list` |
| Port test | `nc -zv <ip> <port> -w 5` |
| Config backup | `cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak` |

## Golden Rules

1. **Always back up config before editing.** `cp openclaw.json openclaw.json.bak`
2. **Always restart gateway after config changes.** Hot reload doesn't catch everything.
3. **Check logs before guessing.** `journalctl` tells you what's wrong 90% of the time.
4. **Calculate your API budget.** Heartbeat freq × (crons + 1) × avg tokens = burn rate.
5. **Disable what you can't reach.** Dead channels create loops that waste resources.
6. **"Configured" ≠ "working."** Verify with actual output after every change.

Related Skills

botlearn-healthcheck

3891
from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure

doctorbot-healthcheck-free

3891
from openclaw/skills

🩺 Free Security & Health Audit. Your OpenClaw deserves a check-up. This skill performs a non-invasive scan to detect security risks, outdated software, and misconfigurations.

healthy-meal-reminder

3891
from openclaw/skills

健康饮食提醒技能。每日三餐+下午茶定时提醒,每次3个方案ABC供选择,饭后30分钟自动跟进记录饮食+计算热量。按季节推荐应季低卡食谱,支持减肥/维持/增肌三模式,含运动搭配、周末放纵餐、互动问答和周报打卡。当用户提到饮食提醒、三餐提醒、吃什么、减肥食谱、健康饮食、meal reminder、吃了什么、体重打卡时激活。

health-check

3891
from openclaw/skills

每日安全检查。检查 OpenClaw Gateway、磁盘空间、内存使用等系统健康状态。触发时机:cron 定时任务或手动调用。

session-health-monitor

3891
from openclaw/skills

Context window health monitoring for OpenClaw agents — threshold warnings via Telegram, pre-compaction snapshots, and memory rotation.

Healthcheck Readiness Starter Skill

3891
from openclaw/skills

Description: Performs a quick risk posture check on the host and reports basic security/posture status.

healthkit-code-review

3891
from openclaw/skills

Reviews HealthKit code for authorization patterns, query usage, background delivery, and data type handling. Use when reviewing code with import HealthKit, HKHealthStore, HKSampleQuery, HKObserverQuery, or HKQuantityType.

bluebubbles-healthcheck

3891
from openclaw/skills

Diagnoses and auto-heals BlueBubbles ↔ OpenClaw iMessage connectivity. Use when: iMessages stop arriving after a gateway restart, webhook connection is broken, or user reports messages not coming through. Runs a 4-step diagnostic and auto-fixes webhook backoff, stale registrations, and gateway issues.

Huangdi Health Timer

3891
from openclaw/skills

12 two-hour energy cycles, 3 unique tips daily.

org-health-diagnostic

3891
from openclaw/skills

Cross-functional organizational health check combining signals from all C-suite roles. Scores 8 dimensions on a traffic-light scale with drill-down recommendations. Use when assessing overall company health, preparing for board reviews, identifying at-risk functions, or when user mentions org health, health check, or health dashboard.

healthie

3891
from openclaw/skills

Healthie — manage patients, appointments, goals, and documents via GraphQL API

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation