eval-skills

Audit all skills in the current project for frontmatter completeness, effort level appropriateness, allowed-tools scoping, and content quality. Produces a scored report with effort-level recommendations for each skill. Use when onboarding to a new project, reviewing skill quality before shipping, or adding effort fields to an existing skill library.

3,046 stars

byFlorianBruniaux

View on GitHub Installation ↓

Best use case

eval-skills is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using eval-skills should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-skills/SKILL.md --create-dirs "https://raw.githubusercontent.com/FlorianBruniaux/claude-code-ultimate-guide/main/examples/skills/eval-skills/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/eval-skills/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How eval-skills Compares

Feature / Agent	eval-skills	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

SKILL.md Source

# Skill Evaluator

Discover all skills in the project, score them across 6 criteria, and infer the appropriate `effort` level based on content analysis.

## When to Use

- New project: run once to establish baseline quality
- Before committing a skill to a team repo
- After bulk-importing skills from another project
- When adding `effort` fields for the first time (v2.1.80+)

## What Gets Audited

All `SKILL.md` files and flat `.md` files found in:
- `.claude/skills/**`
- `~/.claude/skills/**` (if requested)
- Any path passed as argument: `/eval-skills ./my-skills-dir`

---

## Scoring Criteria (14 pts per skill)

| # | Criterion | Max | What is checked |
|---|-----------|-----|-----------------|
| 1 | **name** | 1 | Present, lowercase, hyphens only, matches directory name |
| 2 | **description** | 2 | Present + has "Use when" / "when to" / trigger phrasing |
| 3 | **allowed-tools** | 2 | Present + not overly broad (Bash without scoping when read-only) |
| 4 | **effort** | 3 | Present (1pt) + appropriate for content (2pt based on inference) |
| 5 | **content structure** | 4 | Has Purpose/When section (1), has examples/usage (1), has clear workflow (1), no placeholder text (1) |
| 6 | **bonus** | +2 | argument-hint present (1), version/author metadata (1) |

> **Note**: `tags` is NOT an officially supported frontmatter field in Claude Code. It is ignored by the runtime. Do not include it or score it as a quality criterion.

**Thresholds:**
- ✅ Good: ≥11/14 (≥80%)
- ⚠️ Needs work: 8–10/14 (60–79%)
- ❌ Fix: <8/14 (<60%)

---

## Effort Level Inference Engine

For each skill, analyze description + content and classify using these signals:

### `low` — Mechanical execution, no design decisions

Signals:
- Verbs: commit, push, sync, scaffold, generate (template-based), format, rename, bump, wrap, convert
- No reasoning required: sequential steps, template instantiation, data fetching
- allowed-tools: Bash only, or Read-only
- No sub-agents spawned
- Short workflow (<5 steps)

Examples: `/commit`, `/release-notes`, `/scaffold`, `/sync`, `/format`

### `medium` — Analysis with bounded scope, categorization

Signals:
- Verbs: review, triage, analyze, categorize, suggest, evaluate (single file or bounded scope)
- Requires pattern recognition but not architectural reasoning
- allowed-tools: Read + Grep + Bash combination
- May spawn 1-2 sub-agents but with predefined scope
- Produces structured output (tables, categorized lists)

Examples: `/code-review` (single PR), `/issue-triage`, `/dependency-audit`, `/test-coverage`

### `high` — Design decisions, adversarial reasoning, cross-system analysis

Signals:
- Verbs: architect, redesign, threat-model, audit (security), orchestrate (multi-agent), score, assess trade-offs
- Requires reasoning about edge cases, attack vectors, or system-wide implications
- allowed-tools: broad access (Read + Write + Bash + external tools)
- Spawns multiple sub-agents or uses parallel execution
- Produces analysis with explicit uncertainty or trade-off sections
- Keywords in content: "security", "architecture", "adversarial", "pipeline", "threat", "design decision"

Examples: `/security-audit`, `/architecture-review`, `/cyber-defense`, `/eval-agents`

### Mismatch flag

If a skill has `effort:` already set but the inferred level differs, flag it:
> ⚠️ Effort mismatch: declared `low`, inferred `high` — skill spawns 4 sub-agents and performs security analysis

---

## Execution Instructions

### Step 1 — Discovery

```bash
# Find all SKILL.md files
find .claude/skills -name "SKILL.md" 2>/dev/null

# Find flat skill files
find .claude/skills -maxdepth 1 -name "*.md" ! -name "README*" 2>/dev/null

# If argument provided, use that path instead
```

### Step 2 — Parse each skill

For each skill file found:
1. Read the full file
2. Extract YAML frontmatter (between first `---` and second `---`)
3. Parse: name, description, allowed-tools, effort, argument-hint, version
4. Note presence/absence of each field
5. Read the body content for structure analysis

### Step 3 — Score and infer

Apply the scoring criteria above to each skill:
- Check frontmatter fields
- Evaluate description quality (does it answer "when to use"? is it under 1024 chars?)
- Evaluate allowed-tools scope (is Bash used when only Read would suffice? are tools scoped with wildcards when possible?)
- Infer effort level from content analysis
- Compare inferred vs declared effort (if set)
- Evaluate content structure (scan for "When to Use", "Purpose", "Example", "Workflow" sections)

### Step 4 — Output

Produce a structured report:

```
# Skills Audit — [project name or path]
Date: [today] | Scanned: N skills

## Summary
| Status | Count |
|--------|-------|
| ✅ Good (≥80%) | N |
| ⚠️ Needs work (60–79%) | N |
| ❌ Fix (<60%) | N |

**Effort coverage**: N/N skills have effort field set

---

## Per-Skill Results

### [skill-name] — [score]/15 [✅/⚠️/❌]

| Criterion | Score | Notes |
|-----------|-------|-------|
| name | ✅ 1/1 | — |
| description | ⚠️ 1/2 | Missing "Use when" phrasing |
| allowed-tools | ✅ 2/2 | Well-scoped |
| effort | ❌ 0/3 | Missing — Recommended: high |
| content structure | ⚠️ 2/4 | No examples section |

**Effort inference**: `high` — skill performs security analysis with adversarial reasoning
  Signals: "threat", "attack surface", "vulnerability scoring" in content; spawns 4 agents

**Priority fixes** (ordered by impact):
1. Add `effort: high` to frontmatter
2. Add "Use when" to description
3. Add a concrete usage example section

---
```

After all skills: print a **Fix Summary** — all missing effort fields with recommended values, ready to copy-paste.

---

## Fix Summary Format

At the end, print a ready-to-use patch block for all missing/mismatched effort fields:

```
## Recommended effort fields (copy-paste ready)

skill-name-1: effort: low     # mechanical scaffold
skill-name-2: effort: high    # security analysis, spawns agents
skill-name-3: effort: medium  # code review, bounded scope
```

And a 1-line count: `N skills need effort field · N mismatches · N missing allowed-tools`

Related Skills

audit-agents-skills

3046

from FlorianBruniaux/claude-code-ultimate-guide

Audit Claude Code agents, skills, and commands for quality and production readiness. Use when evaluating skill quality, checking production readiness scores, or comparing agents against best-practice templates.

voice-refine

3046

from FlorianBruniaux/claude-code-ultimate-guide

Transform verbose voice input into structured, token-efficient Claude prompts. Use when cleaning up voice memos, dictation output, or speech-to-text transcriptions that contain filler words, repetitions, and unstructured thoughts.

talk-stage6-revision

3046

from FlorianBruniaux/claude-code-ultimate-guide

Produces revision sheets with quick navigation by act, a master concept-to-URL table, Q&A cheat-sheet with 6-10 anticipated questions, glossary, and external resources list. Use when preparing for a talk with Q&A, creating shareable reference material for attendees, or building a safety-net glossary for live delivery.

talk-stage5-script

3046

from FlorianBruniaux/claude-code-ultimate-guide

Produces a complete 5-act pitch with speaker notes, a slide-by-slide specification, and a ready-to-paste Kimi prompt for AI slide generation. Requires validated angle and title from Stage 4. Use when you have a confirmed talk angle and need the full script, slide spec, and AI-generated presentation prompt.

talk-stage4-position

3046

from FlorianBruniaux/claude-code-ultimate-guide

Generates 3-4 strategic talk angles with strength/weakness analysis, title options, CFP descriptions, and a peer feedback draft, then enforces a mandatory CHECKPOINT for user confirmation before scripting. Use when deciding how to frame a talk, preparing a CFP submission, or choosing between multiple narrative angles.

talk-stage3-concepts

3046

from FlorianBruniaux/claude-code-ultimate-guide

Builds a numbered, categorized concept catalogue from the talk summary and timeline, scoring each concept HIGH / MEDIUM / LOW for talk potential with optional repo enrichment. Use when you need a structured inventory of concepts before choosing a talk angle, or when assessing which ideas have the strongest presentation potential.

talk-stage2-research

3046

from FlorianBruniaux/claude-code-ultimate-guide

Performs git archaeology, changelog analysis, and builds a verified factual timeline by cross-referencing git history with source material. REX mode only — skipped automatically in Concept mode. Use when building a REX talk and you need verified commit metrics, release timelines, and contributor data from a git repository.

talk-stage1-extract

3046

from FlorianBruniaux/claude-code-ultimate-guide

Extracts and structures source material (articles, transcripts, notes) into a talk summary with narrative arc, themes, metrics, and gaps. Auto-detects REX vs Concept type. Use when starting a new talk from any source material or auditing existing material before committing to a talk.

talk-pipeline

3046

from FlorianBruniaux/claude-code-ultimate-guide

Orchestrates the complete talk preparation pipeline from raw material to revision sheets, running 6 stages in sequence with human-in-the-loop checkpoints for REX or Concept mode talks. Use when starting a new talk pipeline, resuming a pipeline from a specific stage, or running the full end-to-end preparation workflow.

skill-creator

3046

from FlorianBruniaux/claude-code-ultimate-guide

Scaffold a new Claude Code skill with SKILL.md, frontmatter, and bundled resources. Use when creating a custom skill, standardizing skill structure across a team, or packaging a skill for distribution.

rtk-optimizer

3046

from FlorianBruniaux/claude-code-ultimate-guide

Wrap high-verbosity shell commands with RTK to reduce token consumption. Use when running git log, git diff, cargo test, pytest, or other verbose CLI output that wastes context window tokens.

release-notes-generator

3046

from FlorianBruniaux/claude-code-ultimate-guide

Generate release notes in 3 formats (CHANGELOG.md, PR body, Slack announcement) from git commits. Automatically categorizes changes and converts technical language to user-friendly messaging. Use for releases, changelogs, version notes, what's new summaries, or ship announcements.