Agent Observability & Monitoring

Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.

3,891 stars

Best use case

Agent Observability & Monitoring is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.

Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.

Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.

Practical example

Example input

Use the "Agent Observability & Monitoring" skill to help with this workflow task. Context: Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.

Example output

A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.

When to use this skill

  • Use this skill when you want a reusable workflow rather than writing the same prompt again and again.

When not to use this skill

  • Do not use this when you only need a one-off answer and do not need a reusable workflow.
  • Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/afrexai-agent-observability/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/1kalin/afrexai-agent-observability/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/afrexai-agent-observability/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How Agent Observability & Monitoring Compares

Feature / AgentAgent Observability & MonitoringStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Agent Observability & Monitoring

Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.

## What This Does

Evaluates your agent deployment across 6 dimensions and returns a 0-100 health score with specific fixes.

## 6-Dimension Assessment

### 1. Execution Visibility (0-20 pts)
- Can you see what every agent is doing right now?
- Task queue depth, active/idle ratio, error rates
- **Benchmark**: Top quartile tracks 95%+ of agent actions in real-time

### 2. Cost Attribution (0-20 pts)
- Do you know exactly what each agent costs per task?
- Token spend, API calls, compute time, tool invocations
- **Benchmark**: Unmonitored agents waste 30-55% on retries and hallucination loops

### 3. Output Quality (0-15 pts)
- Are agent outputs validated before reaching users or systems?
- Accuracy sampling, hallucination detection, regression tracking
- **Benchmark**: 1 in 12 agent outputs contains a material error without monitoring

### 4. Failure Recovery (0-15 pts)
- What happens when an agent fails mid-task?
- Retry logic, graceful degradation, human escalation paths
- **Benchmark**: Mean time to detect agent failure without monitoring: 4.2 hours

### 5. Security & Boundaries (0-15 pts)
- Are agents staying within authorized scope?
- Tool access auditing, data exfiltration checks, permission drift
- **Benchmark**: 23% of production agents access tools outside their intended scope

### 6. Fleet Coordination (0-15 pts)
- Do multi-agent workflows hand off cleanly?
- Message passing reliability, deadlock detection, duplicate work
- **Benchmark**: Uncoordinated fleets duplicate 18-25% of work

## Scoring

| Score | Rating | Action |
|-------|--------|--------|
| 80-100 | Production-grade | Optimize and scale |
| 60-79 | Operational | Fix gaps before scaling |
| 40-59 | Risky | Immediate remediation needed |
| 0-39 | Blind | Stop scaling, instrument first |

## Quick Assessment Prompt

Ask the agent to evaluate your setup:

```
Run the agent observability assessment against our current deployment:
- How many agents are running?
- What monitoring exists today?
- What broke in the last 30 days?
- What's our monthly agent spend?
- Who gets alerted when an agent fails?
```

## Cost Framework

| Company Size | Unmonitored Waste | Monitoring Investment | Net Savings |
|-------------|-------------------|----------------------|-------------|
| 1-5 agents | $2K-$8K/mo | $500-$1K/mo | $1.5K-$7K/mo |
| 5-20 agents | $8K-$45K/mo | $2K-$5K/mo | $6K-$40K/mo |
| 20-100 agents | $45K-$200K/mo | $8K-$20K/mo | $37K-$180K/mo |

## 90-Day Monitoring Roadmap

**Week 1-2**: Inventory all agents, document intended scope, tag cost centers
**Week 3-4**: Deploy execution logging (every tool call, every output)
**Month 2**: Build dashboards — cost per task, error rate, latency P95
**Month 3**: Automated alerting — failure detection <5 min, cost anomaly flags, scope violations

## 7 Monitoring Mistakes

1. Logging only errors (miss the slow degradation)
2. No cost attribution (agents burn budget invisibly)
3. Monitoring agents like servers (they need task-level observability)
4. Manual review of agent outputs (doesn't scale past 3 agents)
5. No baseline metrics (can't detect regression without a baseline)
6. Alerting on everything (alert fatigue kills response time)
7. Skipping agent-to-agent handoff monitoring (where most fleet failures happen)

## Industry Adjustments

| Industry | Critical Dimension | Why |
|----------|-------------------|-----|
| Financial Services | Security & Boundaries | Regulatory audit trails mandatory |
| Healthcare | Output Quality | Clinical accuracy non-negotiable |
| Legal | Execution Visibility | Billing requires task-level tracking |
| Ecommerce | Cost Attribution | Margin-sensitive, waste kills profit |
| SaaS | Fleet Coordination | Multi-tenant agent isolation |
| Manufacturing | Failure Recovery | Downtime = production line stops |
| Construction | Security & Boundaries | Safety-critical document handling |
| Real Estate | Output Quality | Valuation errors = liability |
| Recruitment | Fleet Coordination | Candidate pipeline handoffs |
| Professional Services | Cost Attribution | Client billing accuracy |

---

## Go Deeper

- **AI Agent Context Packs** — industry-specific decision frameworks: https://afrexai-cto.github.io/context-packs/
- **AI Revenue Leak Calculator** — find where your business loses money to manual processes: https://afrexai-cto.github.io/ai-revenue-calculator/
- **Agent Setup Wizard** — configure your agent stack in 5 minutes: https://afrexai-cto.github.io/agent-setup/

Built by AfrexAI — we help businesses run AI agents that actually make money.

Related Skills

afrexai-observability-engine

3891
from openclaw/skills

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization.

observability-designer

3891
from openclaw/skills

Observability Designer (POWERFUL)

📡 Langfuse Observability

3891
from openclaw/skills

Complete Langfuse v3 observability toolkit for OpenClaw agents — automatic tracing for LLM calls, API calls, tool executions, and custom events. Cost tracking per model, session grouping, evaluation scoring, dashboard queries, and health monitoring. The central nervous system for agent observability.

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891
from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation

find-skills

3891
from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

tavily-search

3891
from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research

baidu-search

3891
from openclaw/skills

Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Data & Research

agent-autonomy-kit

3891
from openclaw/skills

Stop waiting for prompts. Keep working.

Workflow & Productivity

Meeting Prep

3891
from openclaw/skills

Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.

Workflow & Productivity

self-improvement

3891
from openclaw/skills

Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.

Agent Intelligence & Learning

botlearn-healthcheck

3891
from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure