eval-audit

Audit an LLM evaluation pipeline for correctness, coverage, and reliability. 6 diagnostic areas with structured Check/Finding output. Produces prioritized findings by severity and recommends next skills to run. Catches common eval pitfalls before they corrupt your metrics. Triggers on: "eval audit", "audit evals", "evaluation audit", "check eval pipeline", "eval health"

170 stars

byMiosa-osa

View on GitHub Installation ↓

Best use case

eval-audit is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using eval-audit should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-audit/SKILL.md --create-dirs "https://raw.githubusercontent.com/Miosa-osa/canopy/main/library/skills/analysis/eval-audit/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/eval-audit/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How eval-audit Compares

Feature / Agent	eval-audit	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# /eval-audit

> Audit an LLM evaluation pipeline for correctness and reliability.

## Purpose

Systematically audit an LLM evaluation pipeline across 6 diagnostic areas: data quality, metric validity, judge reliability, coverage gaps, statistical rigor, and pipeline integrity. Each area produces structured Check/Finding pairs with severity ratings. The audit catches common eval pitfalls — label leakage, metric gaming, distribution mismatch, underpowered samples, and judge bias — before they corrupt decision-making. Outputs prioritized recommendations and suggests follow-up skills.

## Usage

```bash
# Full audit of an eval pipeline
/eval-audit --pipeline evals/

# Audit specific diagnostic area
/eval-audit --pipeline evals/ --area judge-reliability

# Audit with custom severity threshold
/eval-audit --pipeline evals/ --min-severity major

# Audit from eval config file
/eval-audit --config evals/config.yaml

# Quick scan (skip deep statistical checks)
/eval-audit --pipeline evals/ --quick
```

## Arguments

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--pipeline` | string | required | Path to evaluation pipeline directory |
| `--config` | string | — | Path to eval config file (alternative to `--pipeline`) |
| `--area` | enum | `all` | Diagnostic area: `data`, `metrics`, `judges`, `coverage`, `statistics`, `integrity`, `all` |
| `--min-severity` | enum | `minor` | Minimum severity to report: `critical`, `major`, `minor` |
| `--quick` | flag | false | Quick scan (skip computationally expensive checks) |
| `--output` | string | stdout | Write report to file |
| `--format` | enum | `markdown` | Output format: `markdown`, `json` |

## Workflow

1. **Inventory** — Scan the pipeline directory. Identify eval datasets, metric definitions, judge prompts, scoring scripts, and result files. Build a pipeline map.
2. **Data Quality** — Check eval datasets for: label balance, duplicate entries, data leakage from training, stale or outdated examples, ambiguous ground truth, insufficient diversity.
3. **Metric Validity** — Check metrics for: alignment with actual goals (Goodhart's Law), gaming susceptibility, sensitivity to edge cases, proper aggregation (micro vs macro), confidence intervals.
4. **Judge Reliability** — Check LLM judges for: position bias, verbosity bias, self-preference, prompt sensitivity, inter-rater agreement, calibration against human labels.
5. **Coverage Gaps** — Check for: untested capabilities, missing edge cases, distribution mismatch between eval and production, capability dimensions not represented.
6. **Statistical Rigor** — Check for: sufficient sample sizes, proper significance testing, multiple comparison correction, effect size reporting, reproducibility.
7. **Pipeline Integrity** — Check for: deterministic execution, version pinning, result caching correctness, data pipeline bugs, reporting accuracy.
8. **Prioritize** — Rank all findings by severity and impact. Group related findings. Recommend next skills to address issues.

## Examples

### Full pipeline audit
```
/eval-audit --pipeline evals/summarization/

## Eval Audit — evals/summarization/

### Area 1: Data Quality
| Check | Finding | Severity |
|-------|---------|----------|
| Label balance | 73% positive, 27% negative — skewed | MAJOR |
| Duplicates | 12 duplicate entries found (4.8%) | MINOR |
| Staleness | 40% of examples from pre-2024 data | MAJOR |
| Leakage | No leakage detected | PASS |

### Area 3: Judge Reliability
| Check | Finding | Severity |
|-------|---------|----------|
| Position bias | Judge prefers response A 61% of the time | CRITICAL |
| Verbosity bias | Longer responses scored 0.8 points higher on average | MAJOR |
| Human agreement | Cohen's kappa = 0.42 (moderate) — below 0.6 threshold | MAJOR |

### Recommendations
1. [CRITICAL] Fix position bias — use `/validate-evaluator` to calibrate
2. [MAJOR] Rebalance dataset — use `/synthetic-data` to generate minority class
3. [MAJOR] Update stale examples — 40% of eval data predates current model behavior
```

## Output

```markdown
## Eval Audit Report

### Pipeline: <path>
### Date: <date>
### Overall Health: RED | YELLOW | GREEN

### Findings Summary
- Critical: N
- Major: N
- Minor: N
- Passing checks: N

### [Diagnostic Area Sections with Check/Finding tables...]

### Recommended Next Skills
1. `/validate-evaluator` — Calibrate judge against human labels
2. `/synthetic-data` — Generate balanced eval data
3. `/error-analysis` — Deep-dive on failure patterns

### Priority Actions
1. [Action + owner + deadline suggestion]
2. ...
```

## Dependencies

- Eval pipeline files (datasets, prompts, scripts, results)
- `/validate-evaluator` — Recommended follow-up for judge issues
- `/synthetic-data` — Recommended follow-up for data gaps
- `/error-analysis` — Recommended follow-up for failure patterns
- Statistical libraries for significance testing (when not in `--quick` mode)

Related Skills

security-auditor

170

from Miosa-osa/canopy

Comprehensive security analysis and vulnerability detection

ads audit

170

from Miosa-osa/canopy

Full multi-platform paid advertising audit with parallel subagent delegation. Analyzes Google Ads, Meta Ads, LinkedIn Ads, TikTok Ads, and Microsoft Ads accounts. Generates health score per platform and aggregate score. Triggers on: "audit", "full ad check", "analyze my ads", "account health check", "PPC audit", "ad account audit"

audit

170

from Miosa-osa/canopy

Multi-domain audit with weighted scoring. Spawns parallel subagents per audit domain. Each check has severity weight and category weight. Produces a quantified health score (0-100) with prioritized findings. Supports security, code quality, performance, compliance, and custom domains. Triggers on: "audit", "assess", "evaluate quality", "score"

validate-evaluator

170

from Miosa-osa/canopy

Calibrate LLM-as-Judge evaluators against human labels. Computes TPR, TNR, precision, recall, F1, and Cohen's kappa. Detects systematic biases and recommends prompt corrections. Produces a calibration report with confidence intervals. Triggers on: "validate evaluator", "calibrate judge", "judge accuracy", "evaluator validation", "judge metrics"

eval-rag

170

from Miosa-osa/canopy

Evaluate retrieval and generation quality in RAG pipelines. Separate scoring for retrieval (recall, precision, MRR) and generation (faithfulness, relevance, completeness). End-to-end pipeline assessment with bottleneck identification. Triggers on: "eval rag", "rag evaluation", "retrieval evaluation", "rag quality", "rag metrics"