eval-skills

Audit all skills in the current project for frontmatter completeness, effort level appropriateness, allowed-tools scoping, and content quality. Produces a scored report with effort-level recommendations for each skill. Use when onboarding to a new project, reviewing skill quality before shipping, or adding effort fields to an existing skill library.

3,046 stars

Best use case

eval-skills is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Audit all skills in the current project for frontmatter completeness, effort level appropriateness, allowed-tools scoping, and content quality. Produces a scored report with effort-level recommendations for each skill. Use when onboarding to a new project, reviewing skill quality before shipping, or adding effort fields to an existing skill library.

Teams using eval-skills should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-skills/SKILL.md --create-dirs "https://raw.githubusercontent.com/FlorianBruniaux/claude-code-ultimate-guide/main/examples/skills/eval-skills/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/eval-skills/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How eval-skills Compares

Feature / Agenteval-skillsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Audit all skills in the current project for frontmatter completeness, effort level appropriateness, allowed-tools scoping, and content quality. Produces a scored report with effort-level recommendations for each skill. Use when onboarding to a new project, reviewing skill quality before shipping, or adding effort fields to an existing skill library.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Skill Evaluator

Discover all skills in the project, score them across 6 criteria, and infer the appropriate `effort` level based on content analysis.

## When to Use

- New project: run once to establish baseline quality
- Before committing a skill to a team repo
- After bulk-importing skills from another project
- When adding `effort` fields for the first time (v2.1.80+)

## What Gets Audited

All `SKILL.md` files and flat `.md` files found in:
- `.claude/skills/**`
- `~/.claude/skills/**` (if requested)
- Any path passed as argument: `/eval-skills ./my-skills-dir`

---

## Scoring Criteria (14 pts per skill)

| # | Criterion | Max | What is checked |
|---|-----------|-----|-----------------|
| 1 | **name** | 1 | Present, lowercase, hyphens only, matches directory name |
| 2 | **description** | 2 | Present + has "Use when" / "when to" / trigger phrasing |
| 3 | **allowed-tools** | 2 | Present + not overly broad (Bash without scoping when read-only) |
| 4 | **effort** | 3 | Present (1pt) + appropriate for content (2pt based on inference) |
| 5 | **content structure** | 4 | Has Purpose/When section (1), has examples/usage (1), has clear workflow (1), no placeholder text (1) |
| 6 | **bonus** | +2 | argument-hint present (1), version/author metadata (1) |

> **Note**: `tags` is NOT an officially supported frontmatter field in Claude Code. It is ignored by the runtime. Do not include it or score it as a quality criterion.

**Thresholds:**
- ✅ Good: ≥11/14 (≥80%)
- ⚠️ Needs work: 8–10/14 (60–79%)
- ❌ Fix: <8/14 (<60%)

---

## Effort Level Inference Engine

For each skill, analyze description + content and classify using these signals:

### `low` — Mechanical execution, no design decisions

Signals:
- Verbs: commit, push, sync, scaffold, generate (template-based), format, rename, bump, wrap, convert
- No reasoning required: sequential steps, template instantiation, data fetching
- allowed-tools: Bash only, or Read-only
- No sub-agents spawned
- Short workflow (<5 steps)

Examples: `/commit`, `/release-notes`, `/scaffold`, `/sync`, `/format`

### `medium` — Analysis with bounded scope, categorization

Signals:
- Verbs: review, triage, analyze, categorize, suggest, evaluate (single file or bounded scope)
- Requires pattern recognition but not architectural reasoning
- allowed-tools: Read + Grep + Bash combination
- May spawn 1-2 sub-agents but with predefined scope
- Produces structured output (tables, categorized lists)

Examples: `/code-review` (single PR), `/issue-triage`, `/dependency-audit`, `/test-coverage`

### `high` — Design decisions, adversarial reasoning, cross-system analysis

Signals:
- Verbs: architect, redesign, threat-model, audit (security), orchestrate (multi-agent), score, assess trade-offs
- Requires reasoning about edge cases, attack vectors, or system-wide implications
- allowed-tools: broad access (Read + Write + Bash + external tools)
- Spawns multiple sub-agents or uses parallel execution
- Produces analysis with explicit uncertainty or trade-off sections
- Keywords in content: "security", "architecture", "adversarial", "pipeline", "threat", "design decision"

Examples: `/security-audit`, `/architecture-review`, `/cyber-defense`, `/eval-agents`

### Mismatch flag

If a skill has `effort:` already set but the inferred level differs, flag it:
> ⚠️ Effort mismatch: declared `low`, inferred `high` — skill spawns 4 sub-agents and performs security analysis

---

## Execution Instructions

### Step 1 — Discovery

```bash
# Find all SKILL.md files
find .claude/skills -name "SKILL.md" 2>/dev/null

# Find flat skill files
find .claude/skills -maxdepth 1 -name "*.md" ! -name "README*" 2>/dev/null

# If argument provided, use that path instead
```

### Step 2 — Parse each skill

For each skill file found:
1. Read the full file
2. Extract YAML frontmatter (between first `---` and second `---`)
3. Parse: name, description, allowed-tools, effort, argument-hint, version
4. Note presence/absence of each field
5. Read the body content for structure analysis

### Step 3 — Score and infer

Apply the scoring criteria above to each skill:
- Check frontmatter fields
- Evaluate description quality (does it answer "when to use"? is it under 1024 chars?)
- Evaluate allowed-tools scope (is Bash used when only Read would suffice? are tools scoped with wildcards when possible?)
- Infer effort level from content analysis
- Compare inferred vs declared effort (if set)
- Evaluate content structure (scan for "When to Use", "Purpose", "Example", "Workflow" sections)

### Step 4 — Output

Produce a structured report:

```
# Skills Audit — [project name or path]
Date: [today] | Scanned: N skills

## Summary
| Status | Count |
|--------|-------|
| ✅ Good (≥80%) | N |
| ⚠️ Needs work (60–79%) | N |
| ❌ Fix (<60%) | N |

**Effort coverage**: N/N skills have effort field set

---

## Per-Skill Results

### [skill-name] — [score]/15 [✅/⚠️/❌]

| Criterion | Score | Notes |
|-----------|-------|-------|
| name | ✅ 1/1 | — |
| description | ⚠️ 1/2 | Missing "Use when" phrasing |
| allowed-tools | ✅ 2/2 | Well-scoped |
| effort | ❌ 0/3 | Missing — Recommended: high |
| content structure | ⚠️ 2/4 | No examples section |

**Effort inference**: `high` — skill performs security analysis with adversarial reasoning
  Signals: "threat", "attack surface", "vulnerability scoring" in content; spawns 4 agents

**Priority fixes** (ordered by impact):
1. Add `effort: high` to frontmatter
2. Add "Use when" to description
3. Add a concrete usage example section

---
```

After all skills: print a **Fix Summary** — all missing effort fields with recommended values, ready to copy-paste.

---

## Fix Summary Format

At the end, print a ready-to-use patch block for all missing/mismatched effort fields:

```
## Recommended effort fields (copy-paste ready)

skill-name-1: effort: low     # mechanical scaffold
skill-name-2: effort: high    # security analysis, spawns agents
skill-name-3: effort: medium  # code review, bounded scope
```

And a 1-line count: `N skills need effort field · N mismatches · N missing allowed-tools`

Related Skills

audit-agents-skills

3046
from FlorianBruniaux/claude-code-ultimate-guide

Audit Claude Code agents, skills, and commands for quality and production readiness. Use when evaluating skill quality, checking production readiness scores, or comparing agents against best-practice templates.

voice-refine

3046
from FlorianBruniaux/claude-code-ultimate-guide

Transform verbose voice input into structured, token-efficient Claude prompts. Use when cleaning up voice memos, dictation output, or speech-to-text transcriptions that contain filler words, repetitions, and unstructured thoughts.

talk-stage6-revision

3046
from FlorianBruniaux/claude-code-ultimate-guide

Produces revision sheets with quick navigation by act, a master concept-to-URL table, Q&A cheat-sheet with 6-10 anticipated questions, glossary, and external resources list. Use when preparing for a talk with Q&A, creating shareable reference material for attendees, or building a safety-net glossary for live delivery.

talk-stage5-script

3046
from FlorianBruniaux/claude-code-ultimate-guide

Produces a complete 5-act pitch with speaker notes, a slide-by-slide specification, and a ready-to-paste Kimi prompt for AI slide generation. Requires validated angle and title from Stage 4. Use when you have a confirmed talk angle and need the full script, slide spec, and AI-generated presentation prompt.

talk-stage4-position

3046
from FlorianBruniaux/claude-code-ultimate-guide

Generates 3-4 strategic talk angles with strength/weakness analysis, title options, CFP descriptions, and a peer feedback draft, then enforces a mandatory CHECKPOINT for user confirmation before scripting. Use when deciding how to frame a talk, preparing a CFP submission, or choosing between multiple narrative angles.

talk-stage3-concepts

3046
from FlorianBruniaux/claude-code-ultimate-guide

Builds a numbered, categorized concept catalogue from the talk summary and timeline, scoring each concept HIGH / MEDIUM / LOW for talk potential with optional repo enrichment. Use when you need a structured inventory of concepts before choosing a talk angle, or when assessing which ideas have the strongest presentation potential.

talk-stage2-research

3046
from FlorianBruniaux/claude-code-ultimate-guide

Performs git archaeology, changelog analysis, and builds a verified factual timeline by cross-referencing git history with source material. REX mode only — skipped automatically in Concept mode. Use when building a REX talk and you need verified commit metrics, release timelines, and contributor data from a git repository.

talk-stage1-extract

3046
from FlorianBruniaux/claude-code-ultimate-guide

Extracts and structures source material (articles, transcripts, notes) into a talk summary with narrative arc, themes, metrics, and gaps. Auto-detects REX vs Concept type. Use when starting a new talk from any source material or auditing existing material before committing to a talk.

talk-pipeline

3046
from FlorianBruniaux/claude-code-ultimate-guide

Orchestrates the complete talk preparation pipeline from raw material to revision sheets, running 6 stages in sequence with human-in-the-loop checkpoints for REX or Concept mode talks. Use when starting a new talk pipeline, resuming a pipeline from a specific stage, or running the full end-to-end preparation workflow.

skill-creator

3046
from FlorianBruniaux/claude-code-ultimate-guide

Scaffold a new Claude Code skill with SKILL.md, frontmatter, and bundled resources. Use when creating a custom skill, standardizing skill structure across a team, or packaging a skill for distribution.

rtk-optimizer

3046
from FlorianBruniaux/claude-code-ultimate-guide

Wrap high-verbosity shell commands with RTK to reduce token consumption. Use when running git log, git diff, cargo test, pytest, or other verbose CLI output that wastes context window tokens.

release-notes-generator

3046
from FlorianBruniaux/claude-code-ultimate-guide

Generate release notes in 3 formats (CHANGELOG.md, PR body, Slack announcement) from git commits. Automatically categorizes changes and converts technical language to user-friendly messaging. Use for releases, changelogs, version notes, what's new summaries, or ship announcements.