eval-harness
Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.
Best use case
eval-harness is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.
Teams using eval-harness should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/eval-harness/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How eval-harness Compares
| Feature / Agent | eval-harness | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Eval Harness ## Overview Evaluation harness methodology adapted from the Everything Claude Code project. Provides structured frameworks for benchmarking agent performance, testing skill quality, and running regression suites. ## Evaluation Types ### 1. Agent Performance Benchmark - Define test cases with known-correct outputs - Run agent against each test case - Score: accuracy, completeness, relevance - Compare against baseline performance - Track performance over time ### 2. Skill Quality Testing - Verify skill instructions produce expected outcomes - Test edge cases and boundary conditions - Measure consistency across multiple runs - Check for harmful or incorrect outputs - Validate against ground truth ### 3. Regression Suite - Collection of previously-passing test cases - Run after any agent/skill modification - Flag regressions with before/after comparison - Maintain pass rate threshold (>= 95%) ### 4. Process Verification - End-to-end process execution with known inputs - Verify each phase produces expected outputs - Check task ordering and dependency satisfaction - Measure total execution time ## Quality Scoring ### Accuracy Score (0-100) - Correctness of output vs expected - Partial credit for partially correct outputs - Penalty for hallucinated or fabricated content ### Completeness Score (0-100) - Coverage of required output elements - Missing sections flagged and scored - Bonus for useful additional context ### Consistency Score (0-100) - Run same input 3 times - Compare outputs for semantic similarity - Flag inconsistencies ### Composite Score - (accuracy * 0.4 + completeness * 0.3 + consistency * 0.3) - Threshold: 80 to pass ## When to Use - After creating new agents or skills - After modifying existing agents or skills - Periodic quality audits - Before promoting skills to production ## Agents Used - Used by process-level evaluation orchestrators - No specific agent dependency (evaluates other agents)
Related Skills
program-evaluation
Design and implement formative, summative, and developmental evaluations using logic models and mixed methods
primary-source-evaluation
Authenticate, date, and critically assess historical documents for provenance, reliability, and bias with systematic source criticism methodology
pqc-evaluator
Post-quantum cryptography evaluation skill for quantum-safe migration
green-synthesis-evaluator
Sustainability assessment skill for evaluating and designing environmentally friendly nanomaterial synthesis routes
iso10993-evaluator
Biological evaluation planning skill implementing ISO 10993-1 for biocompatibility testing strategy
job-evaluation
Analyze and evaluate jobs for internal equity and leveling using point-factor methods
promethee-evaluator
PROMETHEE (Preference Ranking Organization Method for Enrichment Evaluation) skill for outranking-based multi-criteria analysis
cli-e2e-test-harness
Set up E2E test harness for CLI applications with process spawning and assertions.
process-builder
Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.
babysitter
Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)
yolo
Run Babysitter autonomously with minimal manual interruption.
user-install
Install the user-level Babysitter Codex setup.