eval-harness

Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.

509 stars

Best use case

eval-harness is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.

Teams using eval-harness should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-harness/SKILL.md --create-dirs "https://raw.githubusercontent.com/a5c-ai/babysitter/main/library/methodologies/everything-claude-code/skills/eval-harness/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/eval-harness/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How eval-harness Compares

Feature / Agenteval-harnessStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Eval Harness

## Overview

Evaluation harness methodology adapted from the Everything Claude Code project. Provides structured frameworks for benchmarking agent performance, testing skill quality, and running regression suites.

## Evaluation Types

### 1. Agent Performance Benchmark
- Define test cases with known-correct outputs
- Run agent against each test case
- Score: accuracy, completeness, relevance
- Compare against baseline performance
- Track performance over time

### 2. Skill Quality Testing
- Verify skill instructions produce expected outcomes
- Test edge cases and boundary conditions
- Measure consistency across multiple runs
- Check for harmful or incorrect outputs
- Validate against ground truth

### 3. Regression Suite
- Collection of previously-passing test cases
- Run after any agent/skill modification
- Flag regressions with before/after comparison
- Maintain pass rate threshold (>= 95%)

### 4. Process Verification
- End-to-end process execution with known inputs
- Verify each phase produces expected outputs
- Check task ordering and dependency satisfaction
- Measure total execution time

## Quality Scoring

### Accuracy Score (0-100)
- Correctness of output vs expected
- Partial credit for partially correct outputs
- Penalty for hallucinated or fabricated content

### Completeness Score (0-100)
- Coverage of required output elements
- Missing sections flagged and scored
- Bonus for useful additional context

### Consistency Score (0-100)
- Run same input 3 times
- Compare outputs for semantic similarity
- Flag inconsistencies

### Composite Score
- (accuracy * 0.4 + completeness * 0.3 + consistency * 0.3)
- Threshold: 80 to pass

## When to Use

- After creating new agents or skills
- After modifying existing agents or skills
- Periodic quality audits
- Before promoting skills to production

## Agents Used

- Used by process-level evaluation orchestrators
- No specific agent dependency (evaluates other agents)

Related Skills

program-evaluation

509
from a5c-ai/babysitter

Design and implement formative, summative, and developmental evaluations using logic models and mixed methods

primary-source-evaluation

509
from a5c-ai/babysitter

Authenticate, date, and critically assess historical documents for provenance, reliability, and bias with systematic source criticism methodology

pqc-evaluator

509
from a5c-ai/babysitter

Post-quantum cryptography evaluation skill for quantum-safe migration

green-synthesis-evaluator

509
from a5c-ai/babysitter

Sustainability assessment skill for evaluating and designing environmentally friendly nanomaterial synthesis routes

iso10993-evaluator

509
from a5c-ai/babysitter

Biological evaluation planning skill implementing ISO 10993-1 for biocompatibility testing strategy

job-evaluation

509
from a5c-ai/babysitter

Analyze and evaluate jobs for internal equity and leveling using point-factor methods

promethee-evaluator

509
from a5c-ai/babysitter

PROMETHEE (Preference Ranking Organization Method for Enrichment Evaluation) skill for outranking-based multi-criteria analysis

cli-e2e-test-harness

509
from a5c-ai/babysitter

Set up E2E test harness for CLI applications with process spawning and assertions.

process-builder

509
from a5c-ai/babysitter

Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.

Workflow & Productivity

babysitter

509
from a5c-ai/babysitter

Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)

yolo

509
from a5c-ai/babysitter

Run Babysitter autonomously with minimal manual interruption.

user-install

509
from a5c-ai/babysitter

Install the user-level Babysitter Codex setup.