infra-guardian

OpenClaw Agent Infrastructure Guardian — keep your agent's infrastructure alive. Process lifecycle management with detached execution, auto-restart on failure. Cron scheduler health monitoring (per-job detection, auto-recovery). Direct Telegram/messaging alerts independent of OpenClaw. System-level watchdog that runs from crontab, not OpenClaw cron. Use when launching background processes, monitoring cron job health, or when things keep dying silently.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

infra-guardian is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using infra-guardian should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/infra-guardian/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/devops/infra-guardian/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/infra-guardian/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How infra-guardian Compares

Feature / Agent	infra-guardian	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Infra Guardian

Keep your OpenClaw agent's infrastructure alive. Processes, cron jobs, the works.

## What It Does

| Layer | What | How |
|-------|------|-----|
| **Process Management** | Launch, track, auto-restart background processes | `setsid` + `nohup` + registry + healthcheck |
| **Cron Health** | Detect stalled OpenClaw cron jobs per-job | Reads `jobs.json` from disk, checks `nextRunAtMs` vs interval |
| **Auto-Recovery** | Restart gateway when cron scheduler stalls | SIGUSR1 when >50% jobs overdue |
| **Alerting** | Telegram alerts independent of OpenClaw | Direct Bot API calls — works even if OpenClaw is down |

**Core principle:** Monitoring components must not depend on the system they monitor.

## Quick Start

```bash
# Process management
bash scripts/managed-process.sh register my-bot "python3 /path/to/bot.py" 480
bash scripts/managed-process.sh start my-bot
bash scripts/managed-process.sh status

# Infrastructure watchdog (add to system crontab, NOT OpenClaw cron)
*/10 * * * * /path/to/scripts/managed-process.sh watchdog
```

## Commands

| Command | Usage | Description |
|---------|-------|-------------|
| `register` | `register <name> <command> [duration_min]` | Define a managed process. Duration 0 = indefinite. |
| `start` | `start <name>` | Launch registered process (fully detached). |
| `stop` | `stop <name>` | Graceful shutdown via SIGTERM. |
| `restart` | `restart <name>` | Stop then start. |
| `status` | `status [name]` | Show all processes or one specific. |
| `healthcheck` | `healthcheck` | Check all registered processes, restart dead ones. |
| `watchdog` | `watchdog` | **Unified check:** cron health + process health (for system crontab). |
| `cron-health` | `cron-health` | Check OpenClaw cron scheduler per-job health. |
| `proc-health` | `proc-health` | Check key process liveness (configurable patterns). |
| `deregister` | `deregister <name>` | Remove process from registry + clean up files. |

## Infrastructure Watchdog

The `watchdog` command is the unified health check. Run it from **system crontab** — never from OpenClaw cron (you can't monitor the scheduler using the scheduler).

```bash
# Unified (recommended)
bash scripts/managed-process.sh watchdog

# Individual checks
bash scripts/managed-process.sh cron-health
bash scripts/managed-process.sh proc-health
```

### Cron Health (`cron-health`)

Reads OpenClaw's cron state directly from disk (`~/.openclaw/cron/jobs.json`). No API dependency.

**Per-job detection:**
- Each enabled job checked independently
- `nextRunAtMs` overdue by >2× its interval → STALE
- Jobs with `kind: "every"` use their `everyMs` as interval
- Jobs with `kind: "cron"` use 24h as max expected interval

**Auto-recovery:**
- If >50% of jobs are stale → sends SIGUSR1 to restart gateway
- Alerts via Telegram Bot API directly (reads bot token from OpenClaw config)
- 30-minute cooldown between alerts (no spam)

**Why this matters:** OpenClaw has a known cron bug ([#8424](https://github.com/openclaw/openclaw/issues/8424)) where `kind: "cron"` jobs permanently stall after missing a run. This watchdog catches it within 20 minutes instead of discovering it 8 hours later.

### Process Health (`proc-health`)

Checks known background processes via `pgrep`. Configurable patterns in the script.

- Alerts via Telegram if any monitored process is down
- 30-minute cooldown

### Setup

```bash
# Add to system crontab (not OpenClaw cron!)
crontab -e
# Add this line:
*/10 * * * * /home/clawdbot/clawd/scripts/managed-process.sh watchdog
```

## Process Management

### The Problem

When AI agents launch background processes via `exec &` or `nohup`, those processes are tied to the parent session. Session ends → child processes get SIGTERM → die silently → nobody knows.

### The Rule

**ALL long-running processes MUST go through this framework.** No exceptions.

- ❌ `python script.py &`
- ❌ `nohup python script.py &`
- ❌ `exec` with `background: true`
- ✅ `bash scripts/managed-process.sh register <name> <cmd>` then `start <name>`

### Detached Execution

Processes launch via `setsid` + `nohup` + `disown`, giving them:
- Own session ID (SID) — not tied to any terminal
- PPID=1 (init) — survives parent death
- Immune to SIGHUP/session cleanup

### Health Monitoring

The `healthcheck` command (can run from cron every 5 min):
1. Checks each registered process is alive (PID exists)
2. If dead: checks if it completed normally (ran ≥80% of expected duration)
3. If premature death: auto-restarts and writes alert flag
4. Alert cooldown: 15 min between alerts (no spam)

### Alert Integration

When a process dies, the healthcheck writes:
- `/tmp/process_monitor_alert.flag` — trigger file
- `/tmp/process_monitor_alert.txt` — alert message

Configure your agent's heartbeat to check these files and forward alerts.

### Process Registry

All processes are tracked in `.process-registry.json`:

```json
{
  "my-bot": {
    "command": "python3 /path/to/bot.py",
    "duration_min": 480,
    "auto_restart": true,
    "max_restarts": 5,
    "restart_cooldown": 30
  }
}
```

## Signal Handling Best Practice

Your scripts should log signals, not silently exit:

```python
import signal
from datetime import datetime

def handler(signum, frame):
    sig_name = signal.Signals(signum).name
    print(f"⚠️ SIGNAL: {sig_name} at {datetime.now()}", flush=True)
    global shutdown
    shutdown = True

signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)
```

Never call `sys.exit(0)` in signal handlers — it makes crashes look like clean exits, preventing watchdogs from restarting.

## Architecture

```
System Crontab (every 10 min)
  └─ managed-process.sh watchdog
       ├─ cron-health: reads ~/.openclaw/cron/jobs.json
       │    ├─ per-job: nextRunAtMs overdue > 2× interval?
       │    ├─ if >50% stale → SIGUSR1 gateway restart
       │    └─ alert → Telegram Bot API (direct, no OpenClaw)
       └─ proc-health: pgrep known patterns
            └─ alert → Telegram Bot API (direct, no OpenClaw)
```

**Independence chain:** System crontab → bash script → disk read → direct Telegram API. Zero OpenClaw dependencies in the monitoring path.

Related Skills

julien-infra-hostinger-database

from diegosouzapw/awesome-omni-skill

Manage shared database instances on Hostinger VPS srv759970 - PostgreSQL, Redis, MongoDB operations. Use for database connections, backups, user management, performance checks, or troubleshooting database issues.

infrastructure

from diegosouzapw/awesome-omni-skill

Principal DevOps and infrastructure for FFP AWS serverless stack. Use when working with SST, Lambda configuration, API Gateway, Cognito, RDS, S3, CloudFront, VPC, CI/CD pipelines, monitoring, or environment management. Enforces security best practices and cost-conscious architecture.

infrastructure-verification

from diegosouzapw/awesome-omni-skill

Verify AWS infrastructure configuration before deployment. Use when validating VPC endpoints, NAT Gateway capacity, security groups, or debugging network path issues that cause Lambda connection timeouts.

infrastructure-diagrams

from diegosouzapw/awesome-omni-skill

Create professional Azure, hybrid, and on-premises infrastructure architecture diagrams using Python's Diagrams library. Use when asked to create architecture diagrams, infrastructure diagrams, cloud diagrams, network diagrams, system architecture visualizations, or data center layouts. Supports Azure (VMs, networking, storage, databases, containers, security), on-premises (servers, databases, networking equipment, monitoring), Kubernetes, and hybrid cloud scenarios. Outputs PNG, SVG, or PDF files.

infrastructure-cost

from diegosouzapw/awesome-omni-skill

Analyze and reduce cloud infrastructure costs — right-size resources, eliminate waste, optimize reserved capacity. Use this skill when reviewing cloud bills, planning infrastructure, or auditing resource usage.

infrastructure-as-code

from diegosouzapw/awesome-omni-skill

Define, deploy, and manage cloud infrastructure as code using tools like Terraform, Pulumi, CloudFormation, and CDK, ensuring consistency, repeatability, and version control.

discover-infra

from diegosouzapw/awesome-omni-skill

Automatically discover cloud, infrastructure, deployment, and container skills when working with AWS, GCP, Azure, Docker, Kubernetes, Terraform, Netlify, Heroku, serverless, or IaC

devops-infrastructure

from diegosouzapw/awesome-omni-skill

クラウドインフラ設計・IaC実装・監視設定・コンテナオーケストレーション。AWS、GCP、Azureのリソース構築、Terraform/Pulumi、Kubernetes、Docker、Prometheus/Grafana監視。「インフラ」「クラウド」「Terraform」「Kubernetes」「監視」「Docker」に関する質問で使用。

devops-infra-github

from diegosouzapw/awesome-omni-skill

Expert guidance for containerization, orchestration, and CI/CD pipelines for Bun monorepo projects.

design-infrastructure

from diegosouzapw/awesome-omni-skill

インフラ基盤構成設計エージェント - AWS/Azure/GCP/OpenShift向けのKubernetes・IaC構成を設計・生成。/design-infrastructure で呼び出し。

deployment-infrastructure

from diegosouzapw/awesome-omni-skill

Kubernetes deployment and infrastructure patterns

cloud-infrastructure-network-engineer

from diegosouzapw/awesome-omni-skill

Expert network engineer specializing in modern cloud networking, security architectures, and performance optimization. Masters multi-cloud connectivity, service mesh, zero-trust networking, SSL/TLS, global load balancing, and advanced troubleshooting. Handles CDN optimization, network automation, and compliance. Use PROACTIVELY for network design, connectivity issues, or performance optimization. Use when: the task directly matches network engineer responsibilities within plugin cloud-infrastructure. Do not use when: a more specific framework or task-focused skill is clearly a better match.