Agent Observability & Monitoring
Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.
Best use case
Agent Observability & Monitoring is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.
Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.
Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.
Practical example
Example input
Use the "Agent Observability & Monitoring" skill to help with this workflow task. Context: Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.
Example output
A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.
When to use this skill
- Use this skill when you want a reusable workflow rather than writing the same prompt again and again.
When not to use this skill
- Do not use this when you only need a one-off answer and do not need a reusable workflow.
- Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/afrexai-agent-observability/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Agent Observability & Monitoring Compares
| Feature / Agent | Agent Observability & Monitoring | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
AI Agent for Product Research
Browse AI agent skills for product research, competitive analysis, customer discovery, and structured product decision support.
SKILL.md Source
# Agent Observability & Monitoring Score, monitor, and troubleshoot AI agent fleets in production. Built for ops teams running 1-100+ agents. ## What This Does Evaluates your agent deployment across 6 dimensions and returns a 0-100 health score with specific fixes. ## 6-Dimension Assessment ### 1. Execution Visibility (0-20 pts) - Can you see what every agent is doing right now? - Task queue depth, active/idle ratio, error rates - **Benchmark**: Top quartile tracks 95%+ of agent actions in real-time ### 2. Cost Attribution (0-20 pts) - Do you know exactly what each agent costs per task? - Token spend, API calls, compute time, tool invocations - **Benchmark**: Unmonitored agents waste 30-55% on retries and hallucination loops ### 3. Output Quality (0-15 pts) - Are agent outputs validated before reaching users or systems? - Accuracy sampling, hallucination detection, regression tracking - **Benchmark**: 1 in 12 agent outputs contains a material error without monitoring ### 4. Failure Recovery (0-15 pts) - What happens when an agent fails mid-task? - Retry logic, graceful degradation, human escalation paths - **Benchmark**: Mean time to detect agent failure without monitoring: 4.2 hours ### 5. Security & Boundaries (0-15 pts) - Are agents staying within authorized scope? - Tool access auditing, data exfiltration checks, permission drift - **Benchmark**: 23% of production agents access tools outside their intended scope ### 6. Fleet Coordination (0-15 pts) - Do multi-agent workflows hand off cleanly? - Message passing reliability, deadlock detection, duplicate work - **Benchmark**: Uncoordinated fleets duplicate 18-25% of work ## Scoring | Score | Rating | Action | |-------|--------|--------| | 80-100 | Production-grade | Optimize and scale | | 60-79 | Operational | Fix gaps before scaling | | 40-59 | Risky | Immediate remediation needed | | 0-39 | Blind | Stop scaling, instrument first | ## Quick Assessment Prompt Ask the agent to evaluate your setup: ``` Run the agent observability assessment against our current deployment: - How many agents are running? - What monitoring exists today? - What broke in the last 30 days? - What's our monthly agent spend? - Who gets alerted when an agent fails? ``` ## Cost Framework | Company Size | Unmonitored Waste | Monitoring Investment | Net Savings | |-------------|-------------------|----------------------|-------------| | 1-5 agents | $2K-$8K/mo | $500-$1K/mo | $1.5K-$7K/mo | | 5-20 agents | $8K-$45K/mo | $2K-$5K/mo | $6K-$40K/mo | | 20-100 agents | $45K-$200K/mo | $8K-$20K/mo | $37K-$180K/mo | ## 90-Day Monitoring Roadmap **Week 1-2**: Inventory all agents, document intended scope, tag cost centers **Week 3-4**: Deploy execution logging (every tool call, every output) **Month 2**: Build dashboards — cost per task, error rate, latency P95 **Month 3**: Automated alerting — failure detection <5 min, cost anomaly flags, scope violations ## 7 Monitoring Mistakes 1. Logging only errors (miss the slow degradation) 2. No cost attribution (agents burn budget invisibly) 3. Monitoring agents like servers (they need task-level observability) 4. Manual review of agent outputs (doesn't scale past 3 agents) 5. No baseline metrics (can't detect regression without a baseline) 6. Alerting on everything (alert fatigue kills response time) 7. Skipping agent-to-agent handoff monitoring (where most fleet failures happen) ## Industry Adjustments | Industry | Critical Dimension | Why | |----------|-------------------|-----| | Financial Services | Security & Boundaries | Regulatory audit trails mandatory | | Healthcare | Output Quality | Clinical accuracy non-negotiable | | Legal | Execution Visibility | Billing requires task-level tracking | | Ecommerce | Cost Attribution | Margin-sensitive, waste kills profit | | SaaS | Fleet Coordination | Multi-tenant agent isolation | | Manufacturing | Failure Recovery | Downtime = production line stops | | Construction | Security & Boundaries | Safety-critical document handling | | Real Estate | Output Quality | Valuation errors = liability | | Recruitment | Fleet Coordination | Candidate pipeline handoffs | | Professional Services | Cost Attribution | Client billing accuracy | --- ## Go Deeper - **AI Agent Context Packs** — industry-specific decision frameworks: https://afrexai-cto.github.io/context-packs/ - **AI Revenue Leak Calculator** — find where your business loses money to manual processes: https://afrexai-cto.github.io/ai-revenue-calculator/ - **Agent Setup Wizard** — configure your agent stack in 5 minutes: https://afrexai-cto.github.io/agent-setup/ Built by AfrexAI — we help businesses run AI agents that actually make money.
Related Skills
afrexai-observability-engine
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization.
observability-designer
Observability Designer (POWERFUL)
📡 Langfuse Observability
Complete Langfuse v3 observability toolkit for OpenClaw agents — automatic tracing for LLM calls, API calls, tool executions, and custom events. Cost tracking per model, session grouping, evaluation scoring, dashboard queries, and health monitoring. The central nervous system for agent observability.
---
name: article-factory-wechat
humanizer
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.
find-skills
Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
tavily-search
Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.
baidu-search
Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.
agent-autonomy-kit
Stop waiting for prompts. Keep working.
Meeting Prep
Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.
self-improvement
Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.
botlearn-healthcheck
botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.