service-health-check
Service health monitoring: Discover, Check, Report in 3 phases.
Best use case
service-health-check is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Service health monitoring: Discover, Check, Report in 3 phases.
Teams using service-health-check should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/service-health-check/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How service-health-check Compares
| Feature / Agent | service-health-check | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Service health monitoring: Discover, Check, Report in 3 phases.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Service Health Check Skill
## Overview
This skill provides deterministic service health monitoring using the **Discover-Check-Report** pattern. It finds services, gathers health signals from multiple sources (process table, health files, port binding), and produces actionable reports identifying degraded or failed services.
**Core principle**: Health assessment is evidence-based. Never report a service healthy without verifying process status independently of health file content. Never assume a running process is functional — always cross-check against health files and port binding.
---
## Instructions
### Phase 1: DISCOVER
**Goal**: Identify all services to check before running any health probes.
**Step 1: Locate service definitions**
Search for service configuration in this order:
1. `services.json` in project root
2. Docker/docker-compose files for service definitions
3. systemd unit files or process manager configs
4. User-provided service specification
**Step 2: Build service manifest**
For each service, establish:
```markdown
## Service Manifest
| Service | Process Pattern | Health File | Port | Stale Threshold |
|---------|----------------|-------------|------|-----------------|
| api-server | gunicorn.*app:app | /tmp/api_health.json | 8000 | 300s |
| worker | celery.*worker | /tmp/worker_health.json | - | 300s |
| cache | redis-server | - | 6379 | - |
```
**Validation constraints**:
- Each process pattern must be specific enough to avoid false matches (e.g., "python" matches all Python processes—use full paths or arguments instead)
- Health file paths must be absolute
- Port numbers must be valid (1-65535)
- Pattern specificity matters: narrow patterns with full command paths, distinguishing arguments, or specific binary names
**Step 3: Validate manifest**
Confirm each entry passes the constraints above. If a pattern is too broad, use `ps aux | grep` to identify distinguishing arguments, then update the pattern.
**Gate**: Service manifest complete with at least one service. Proceed only when gate passes.
### Phase 2: CHECK
**Goal**: Gather health signals for every service in the manifest. Always check process status independently of health file content—a running process and a healthy health file are separate signals.
**Step 1: Check process status**
For each service, run process check:
```bash
pgrep -f "<process_pattern>"
```
Record: running (true/false), PIDs, process count.
**Rationale**: Process existence is the primary signal. A missing process always means the service is DOWN. A running process alone is insufficient—the service may have crashed or failed to bind to its port.
**Step 2: Parse health files (if configured)**
Read and parse JSON health files. Evaluate:
- Does the file exist?
- Does it parse as valid JSON?
- How old is the timestamp (staleness)? Default stale threshold is 300 seconds.
- What status does the service self-report?
- What is the connection state?
**Critical constraint**: Never trust health file content alone. The file could be stale from before a process crash. Always verify:
1. Process is still running
2. Health file timestamp is fresh (within configured threshold)
3. Status field matches evidence (e.g., "error" requires restart)
**Step 3: Probe ports (if configured)**
Check if expected ports are listening:
```bash
ss -tlnp "sport = :<port>"
```
**Rationale**: Verify ports are actually bound. A process can start but fail to bind to its configured port—that is effectively a DOWN state, not HEALTHY.
**Step 4: Evaluate health per service**
Apply this decision tree (constraints embedded in logic):
1. **Process not running** → **DOWN** (definitive)
2. **Process running + health file missing** → **WARNING** (limited visibility, but process is alive)
3. **Process running + health file stale** (> threshold) → **WARNING** (file hasn't updated in configured time, suggests no activity or crash recovery in progress)
4. **Process running + status=error** → **ERROR** (restart recommended immediately)
5. **Process running + disconnected > 30 minutes** → **WARNING** (long disconnect suggests stuck state, restart recommended)
6. **Process running + disconnected < 30 minutes** → **DEGRADED** (allow reconnection window, monitor)
7. **Process running + port not listening** (when port is configured) → **ERROR** (process running but failed to bind port)
8. **Process running + healthy** → **HEALTHY** (all checks pass)
9. **Process running + no health file configured** → **RUNNING** (limited visibility, process verified only)
**Gate**: All services evaluated with evidence-based status. No status is determined without concrete signal (process check, health file, or port probe). Proceed only when gate passes.
### Phase 3: REPORT
**Goal**: Produce structured, actionable health report with specific remediation commands.
**Step 1: Generate summary**
```
SERVICE HEALTH REPORT
=====================
Checked: N services
Healthy: X/N
RESULTS:
service-name [OK ] HEALTHY PID 12345, uptime 2d 4h
background-worker [WARN] WARNING Health file stale (15 min)
cache-service [DOWN] DOWN Process not found
RECOMMENDATIONS:
background-worker: Restart recommended - health file not updated in 900s
cache-service: Start service - process not running
SUGGESTED ACTIONS:
systemctl restart background-worker
systemctl start cache-service
```
**Step 2: Set exit status**
- All HEALTHY/RUNNING → exit 0
- Any WARNING/DEGRADED/ERROR/DOWN → exit 1
**Step 3: Present to user**
- Lead with the summary line (X/N healthy)
- Highlight any services needing action
- Provide copy-pasteable commands for remediation
- Never auto-restart without explicit user flag. Always report findings first, let user decide.
**Gate**: Report delivered with actionable recommendations for all non-healthy services.
---
## Examples
### Example 1: Routine Health Check
User says: "Are all services up?"
Actions:
1. Locate services.json, build manifest (DISCOVER)
2. Check each process, parse health files, probe ports (CHECK)
3. Output structured report showing 3/3 healthy (REPORT)
Result: Clean report, no action needed
### Example 2: Stale Worker Detection
User says: "The background worker seems stuck"
Actions:
1. Identify worker service from config (DISCOVER)
2. Find process running but health file 20 minutes stale (CHECK) — triggers WARNING decision in tree
3. Report WARNING with restart recommendation (REPORT)
Result: Specific diagnosis with actionable command
---
## Error Handling
### Error: "No Service Configuration Found"
Cause: No services.json, docker-compose, or systemd units discovered
Solution:
1. Ask user for service name and process pattern
2. Build minimal manifest from user input
3. Proceed with manual configuration
### Error: "Process Pattern Matches Too Many PIDs"
Cause: Pattern too broad (e.g., "python" matches all Python processes)
Solution:
1. Narrow pattern with full command path or arguments
2. Use `ps aux | grep` to identify distinguishing arguments
3. Update manifest with more specific pattern
4. Rationale: False positives hide real failures. Specificity is required to avoid misdiagnosis.
### Error: "Health File Exists But Cannot Parse"
Cause: Malformed JSON, permissions issue, or file being written during read
Solution:
1. Check file permissions with `ls -la`
2. Attempt raw read to inspect content
3. If mid-write, retry after 2-second delay
4. Report as WARNING with parse error details
---
## References
### Health File Format Reference
Services should write health files as:
```json
{
"timestamp": "ISO8601, updated every 30-60s",
"status": "healthy|degraded|error",
"connection": "connected|disconnected|reconnecting",
"last_activity": "ISO8601 of last meaningful action",
"running": true,
"uptime_seconds": 12345,
"metrics": {}
}
```
### Key Constraints Summary
| Constraint | Rationale | Application |
|-----------|-----------|-------------|
| Process status verified independently of health file | Running process ≠ functional service | Always check process before trusting health file |
| Health file staleness detected by timestamp freshness | File could be stale from before crash | Check timestamp against 300s (configurable) threshold |
| Port binding verified when configured | Process running doesn't mean port is bound | Always verify expected port listening when port specified |
| No auto-restart without explicit flag | Restart masks root cause | Report findings first; only execute restart if user flags it |
| Narrow process patterns required | "python" matches all processes, giving false matches | Use full paths or specific args; validate with `ps aux \| grep` |
| Evidence-based status only | Status must have supporting signal | No status without concrete evidence (process, health file, or port) |Related Skills
typescript-check
TypeScript type checking via tsc --noEmit with actionable error output.
pre-publish-checker
Pre-publication validation for Hugo posts: front matter, SEO, links, images.
plan-checker
Validate plans against 10 dimensions: PASS/BLOCK verdict before execution.
joy-check
Validate content framing on joy-grievance spectrum.
integration-checker
Verify cross-component wiring and data flow.
github-actions-check
Check GitHub Actions CI status and report failures.
docs-sync-checker
Detect documentation drift against filesystem state.
x-api
Post tweets, build threads, upload media via the X API.
worktree-agent
Mandatory rules for agents in git worktree isolation.
workflow
Structured multi-phase workflows: review, debug, refactor, deploy, create, research, and more.
workflow-help
Interactive guide to workflow system: agents, skills, routing, execution patterns.
wordpress-uploader
WordPress REST API integration for posts and media uploads.