eval-creator

[Beta] Creates permanent eval cases from promoted learnings and runs regression checks against them. Turns failures into test cases that prevent silent regression. This is the outer loop's regress-test step. Use when a learning is promoted and has a clear pass/fail condition, or on cadence to verify promoted rules still hold.

6 stars

bypskoett

View on GitHub Installation ↓

Best use case

eval-creator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using eval-creator should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-creator/SKILL.md --create-dirs "https://raw.githubusercontent.com/pskoett/measuring-ai-proficiency/main/.claude/skills/eval-creator/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/eval-creator/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How eval-creator Compares

Feature / Agent	eval-creator	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Eval Creator

Turns promoted learnings into permanent eval cases. Runs regression checks to verify promoted rules hold. This is the outer loop's **regress-test** step.

The blog says: "If a failure taught you something important, it should become a permanent test case. Otherwise the knowledge is still fragile."

## When to Use

- **After harness-updater promotes a pattern** — create an eval for it
- **On cadence** — run all evals to check for regression
- **Before major releases** — verify the harness is holding
- **When a promoted rule seems to have stopped working** — diagnose with targeted eval run

## Eval Directory Structure

```
.evals/
  EVAL_INDEX.md          # Index of all eval cases with status
  cases/
    eval-YYYYMMDD-001.md # Individual eval case
    eval-YYYYMMDD-002.md
    ...
```

## Creating an Eval Case

### Input

From harness-updater or manually:
- Pattern-Key of the promoted learning
- The rule that was added to the project instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md)
- What to test (the assertion)
- Verification method

### Eval Case Format

```markdown
---
id: eval-YYYYMMDD-NNN
pattern-key: [from learning]
source: [LRN-YYYYMMDD-001, ERR-YYYYMMDD-003]
promoted-rule: "[the rule text in project instruction files]"
promoted-to: CLAUDE.md  # or AGENTS.md, .github/copilot-instructions.md, or equivalent
created: YYYY-MM-DD
last-run: YYYY-MM-DD
last-result: pass | fail | skip
---

## What This Tests

[One sentence: what failure this eval prevents from recurring]

## Precondition

[What must be true for this eval to be runnable]
- File X exists
- Project uses framework Y
- etc.

## Verification Method

[One of: grep-check, command-check, file-check, rule-check]

### grep-check
Search for a pattern that should (or should not) exist:
```
target: src/**/*.ts
pattern: "hardcoded-secret-pattern"
expect: not_found
```

### command-check
Run a command and check the exit code or output:
```
command: npm run typecheck
expect_exit: 0
```

### file-check
Verify a file or section exists:
```
target: CLAUDE.md  # or AGENTS.md, .github/copilot-instructions.md
section: "## Verification"
expect: exists
```

### rule-check
Verify a rule exists in an instruction file:
```
target: CLAUDE.md  # or AGENTS.md, .github/copilot-instructions.md
contains: "[the promoted rule text or key phrase]"
expect: found
```

## Expected Result

**Pass:** [What "good" looks like]
**Fail:** [What regression looks like]

## Recovery Action

If this eval fails:
1. [Specific step to diagnose]
2. [Specific step to fix]
3. Re-run this eval to verify
```

## Running Evals

### Run All
Read `.evals/EVAL_INDEX.md`, iterate through all cases, execute each verification method.

### Run by Pattern-Key
Filter to evals matching a specific pattern.

### Run by Area
Filter to evals whose source files match an area (frontend, backend, etc.).

### Execution

For each eval case:

1. **Check precondition** — if not met, mark as `skip`
2. **Execute verification method:**
   - `grep-check`: Use Grep tool to search target files for the pattern
   - `command-check`: Run the command via Bash, check exit code and/or output
   - `file-check`: Use Read/Glob to verify file/section existence
   - `rule-check`: Read the target file, search for the expected content
   - `skill-check`: Run `quick_validate.py` on a skill directory (see Skill Validation below)
   - `script-check`: Run a custom mcp-script by name (see Custom Verification Methods)
3. **Compare result** to expected
4. **Update `last-run` and `last-result`** in the eval case file
5. **Update `EVAL_INDEX.md`** with the result

### Regression Report

```markdown
## Eval Run: YYYY-MM-DD

**Total:** N evals
**Passed:** N
**Failed:** N
**Skipped:** N

### Failures

#### eval-YYYYMMDD-001 — [pattern-key]
- **What regressed:** [description]
- **Expected:** [X]
- **Got:** [Y]
- **Recovery action:** [from eval case]

### Summary
[All green / N regressions need attention]
```

## Eval Index Format

`.evals/EVAL_INDEX.md`:

```markdown
# Eval Index

| ID | Pattern-Key | Rule Summary | Last Run | Result | Created |
|----|-------------|-------------|----------|--------|---------|
| eval-YYYYMMDD-001 | auth-middleware-lock | Run migrations on test DB first | YYYY-MM-DD | pass | YYYY-MM-DD |
| eval-YYYYMMDD-002 | pnpm-not-npm | Use pnpm in this repo | YYYY-MM-DD | fail | YYYY-MM-DD |
```

## Integration

### Upstream
- **harness-updater** flags eval candidates after promoting a pattern
- **learning-aggregator** identifies patterns with clear pass/fail conditions

### Downstream
- Regression failures feed back into **self-improvement** as new error entries
- Persistent failures may indicate the promoted rule needs refinement → feed back to **harness-updater**

### Scheduled Use
For projects with a CI pipeline, eval-creator can run as a scheduled check:
- Weekly: run all evals
- Per-PR: run evals related to changed files
- Post-promotion: run the newly created eval immediately

## Custom Verification Methods (mcp-scripts)

Beyond the four built-in methods (grep-check, command-check, file-check, rule-check), projects can define custom verification tools as mcp-scripts for complex assertions that the built-ins can't express.

Example — an eval that verifies a promoted auth rule is enforced:

```yaml
# In gh-aw workflow config
mcp-scripts:
  check-auth-middleware:
    lang: javascript
    description: "Verify all /admin routes have auth middleware"
    run: |
      const routes = require('./src/routes/admin');
      const unprotected = routes.filter(r => !r.auth);
      if (unprotected.length) {
        console.error('Unprotected admin routes:', unprotected.map(r => r.path));
        process.exit(1);
      }
```

Reference the script in an eval case as `verification_method: script-check` with the mcp-script name. This is an extension point — the built-in methods cover most cases, but mcp-scripts handle project-specific behavioral assertions.

## Persistence

Eval cases live in `.evals/` in the working directory. The skill does not integrate with external memory backends in interactive sessions. For CI-side durable storage, see `eval-creator-ci`, which can optionally back its run history with gh-aw's `repo-memory`.

## Skill Validation (skill-check)

The Anthropic `/skill-creator` skill includes two validation systems that eval-creator can use:

### Structural validation via `quick_validate.py`

The `skill-check` verification method runs the skill-creator's `quick_validate.py` script on a skill directory. It checks:

- SKILL.md exists with valid YAML frontmatter
- Only allowed frontmatter keys (`name`, `description`, `license`, `allowed-tools`, `metadata`, `compatibility`)
- Name is kebab-case, max 64 chars, no leading/trailing/consecutive hyphens
- Description has no angle brackets, max 1024 chars
- Compatibility field max 500 chars if present

Eval case example:

```markdown
---
id: eval-YYYYMMDD-NNN
pattern-key: skill-quality.verify-gate
verification_method: skill-check
target: skills/verify-gate
expect: valid
---

## What This Tests
Verify that the verify-gate skill passes structural validation after harness updates.
```

Execution: `python .claude/skills/skill-creator/scripts/quick_validate.py <target>`. Exit 0 = pass, exit 1 = fail.

### Behavioral validation via `run_eval.py`

For deeper validation, the skill-creator's `run_eval.py` tests whether a skill's description causes Claude to invoke it for given queries. This is useful when harness-updater modifies a skill's description or the outer loop creates a new skill — the eval verifies the skill still triggers correctly.

This requires Claude CLI access and is expensive. Use it for high-value skills only, not as a routine CI check.

### When to create skill-check evals

Two scenarios connect the outer loop to skill validation:

1. **Harness-updater modifies a skill**: When a promoted rule is inserted into a SKILL.md (rather than a project instruction file), create a `skill-check` eval to verify the skill remains structurally valid after the edit.

2. **Self-improvement identifies a skill gap**: When learning-aggregator classifies a pattern as `skill_gap` and recommends "create a new skill", the new skill should pass `quick_validate.py` before being committed. Create a `skill-check` eval for it that persists as a regression test.

This closes the loop: failure → learning → new/updated skill → eval verifies skill quality → regression prevents quality drift.

## What This Skill Does NOT Do

- Does not fix regressions (reports them for the agent or human to fix)
- Does not promote learnings (that's harness-updater)
- Does not analyze patterns (that's learning-aggregator)
- Does not replace project test suites — evals test the harness, not the code

Related Skills

Agentic Workflow Creator

from pskoett/measuring-ai-proficiency

Create natural language GitHub Actions workflows using the agentic workflows pattern from GitHub Next.

verify-gate

from pskoett/measuring-ai-proficiency

Runs project compile, test, and lint commands between implementation and quality review. Gates simplify-and-harden behind machine verification. If checks fail, routes back to implementation with diagnostics for a fix loop. If checks pass, signals ready for the quality pass. Use after any implementation work completes and before simplify-and-harden. Essential for the inner loop's verify step.

use-agent-factory

from pskoett/measuring-ai-proficiency

How to drive the 14-workflow agent factory in this repo from a Claude session. Covers: when to use the factory vs. direct edits, how to start the chain, where the human gates are, how to pick an implementer, how to recover from stuck PRs, and all the failure modes learned to date. Use this skill when the user asks you to ship a feature, fix, or refactor through the factory; when they reference an existing issue or PR in the factory chain; when a workflow is stuck or misbehaving; or when you need to file issues or plan files that the factory will pick up. Do NOT use this skill for: single-file scratch edits on an untracked branch, research questions, one-shot script runs, or any work that does not produce a PR to main.

simplify-and-harden

from pskoett/measuring-ai-proficiency

Post-completion self-review for coding agents that runs simplify, harden, and micro-documentation passes on non-trivial code changes. Use when: a coding task is complete in a general agent session and you want a bounded quality and security sweep before signaling done. For CI pipeline execution, use simplify-and-harden-ci.

pre-flight-check

from pskoett/measuring-ai-proficiency

[Beta] Session-start scan that surfaces relevant learnings, recent errors, and eval status before work begins. Bridges the outer loop back into the inner loop by making accumulated knowledge visible at task start. Activated via SessionStart hook or manually before major tasks.

plan-interview

from pskoett/measuring-ai-proficiency

Ensures alignment between user and Claude during feature/spec planning through a structured interview process. Use this skill when the user invokes /plan-interview before implementing a new feature, refactoring, or any non-trivial implementation task. The skill runs an upfront interview to gather requirements across technical constraints, scope boundaries, risk tolerance, and success criteria before any codebase exploration. Do NOT use this skill for: pure research/exploration tasks, simple bug fixes, or when the user just wants standard planning without the interview process.

measure-ai-proficiency

from pskoett/measuring-ai-proficiency

Assess and improve repository AI coding proficiency and context engineering maturity. Use when users ask about: (1) AI readiness or AI maturity assessment, (2) context engineering quality or improvement, (3) CLAUDE.md, .cursorrules, or copilot-instructions files, (4) measuring how well a repo is prepared for AI coding assistants, (5) recommendations for improving AI collaboration, (6) what context files to add, or (7) comparing their repo to AI proficiency best practices.

learning-aggregator

from pskoett/measuring-ai-proficiency

[Beta] Cross-session analysis of accumulated .learnings/ files. Reads all entries, groups by pattern_key, computes recurrence across sessions, and outputs ranked promotion candidates. This is the outer loop's inspect step — it turns raw learning data into actionable gap reports. Use on a regular cadence (weekly, before major tasks, or at session start for critical projects). Can be invoked manually or scheduled.

intent-framed-agent

from pskoett/measuring-ai-proficiency

Frames coding-agent work sessions with explicit intent capture and drift monitoring. Use when a session transitions from planning/Q&A to implementation for coding tasks, refactors, feature builds, bug fixes, or other multi-step execution where scope drift is a risk.

customize-measurement

from pskoett/measuring-ai-proficiency

Customize AI proficiency measurement for your specific repository through a guided interview. Use when: setting up measure-ai-proficiency for a new repo, adjusting thresholds for your team's size, hiding irrelevant recommendations, or mapping custom file names to standard patterns.

context-surfing

from pskoett/measuring-ai-proficiency

Monitors context window health throughout a session and rides peak context quality for maximum output fidelity. Activates automatically after plan-interview and intent-framed-agent. Stays active through execution and hands off cleanly to simplify-and-harden and self-improvement when the wave completes naturally or exits via handoff. Use this skill whenever a multi-step agent task is underway and session continuity or context drift is a concern. Especially important for long-running tasks, complex refactors, or any work where degraded context would silently corrupt the output. Trigger even if the user doesn't say "context surfing" — if an agent task is running across multiple steps with intent and a plan already established, this skill is live.

healthcare-eval-harness

144923

from affaan-m/everything-claude-code

Patient safety evaluation harness for healthcare application deployments. Automated test suites for CDSS accuracy, PHI exposure, clinical workflow integrity, and integration compliance. Blocks deployments on safety failures.

Testing & Quality AssuranceClaude