agent-evaluation
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
Best use case
agent-evaluation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
Teams using agent-evaluation should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/agent-evaluation/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How agent-evaluation Compares
| Feature / Agent | agent-evaluation | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Agent Evaluation You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer. You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it ## Capabilities - agent-testing - benchmark-design - capability-assessment - reliability-metrics - regression-testing ## Requirements - testing-fundamentals - llm-fundamentals ## Patterns ### Statistical Test Evaluation Run tests multiple times and analyze result distributions ### Behavioral Contract Testing Define and test agent behavioral invariants ### Adversarial Testing Actively try to break agent behavior ## Anti-Patterns ### ❌ Single-Run Testing ### ❌ Only Happy Path Tests ### ❌ Output String Matching ## ⚠️ Sharp Edges | Issue | Severity | Solution | |-------|----------|----------| | Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation | | Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation | | Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming | | Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation | ## Related Skills Works well with: `multi-agent-orchestration`, `agent-communication`, `autonomous-agents`
Related Skills
model-evaluation-metrics
Model Evaluation Metrics - Auto-activating skill for ML Training. Triggers on: model evaluation metrics, model evaluation metrics Part of the ML Training skill category.
promptfoo-evaluation
Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
llm-evaluation
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
advanced-evaluation
This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
evaluation
Build evaluation frameworks for agent systems. Use when testing agent performance, validating context engineering choices, or measuring improvements over time.
content-evaluation-framework
This skill should be used when evaluating the quality of book chapters, lessons, or educational content. It provides a systematic 6-category rubric with weighted scoring (Technical Accuracy 30%, Pedagogical Effectiveness 25%, Writing Quality 20%, Structure & Organization 15%, AI-First Teaching 10%, Constitution Compliance Pass/Fail) and multi-tier assessment (Excellent/Good/Needs Work/Insufficient). Use this during iterative drafting, after content completion, on-demand review requests, or before validation phases.
Ragas — RAG Evaluation Framework
## Overview
DeepEval — LLM Testing & Evaluation Framework
## Overview
Braintrust — AI Evaluation and Observability
You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.
lm-evaluation-harness - LLM Benchmarking
## Quick start
BigCode Evaluation Harness - Code Model Benchmarking
## Quick Start
Agentic Evaluation Patterns
Patterns for self-improvement through iterative evaluation and refinement.