skill-judge
Evaluate Agent Skill quality against official specifications. Use when reviewing SKILL.md files, auditing skill packages, improving skill design, or checking if a skill follows best practices. Provides 8-dimension scoring (120 points) with actionable improvements. Triggers on review skill, evaluate skill, audit skill, improve skill, skill quality, SKILL.md review.
Best use case
skill-judge is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Evaluate Agent Skill quality against official specifications. Use when reviewing SKILL.md files, auditing skill packages, improving skill design, or checking if a skill follows best practices. Provides 8-dimension scoring (120 points) with actionable improvements. Triggers on review skill, evaluate skill, audit skill, improve skill, skill quality, SKILL.md review.
Teams using skill-judge should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/skill-judge/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How skill-judge Compares
| Feature / Agent | skill-judge | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Evaluate Agent Skill quality against official specifications. Use when reviewing SKILL.md files, auditing skill packages, improving skill design, or checking if a skill follows best practices. Provides 8-dimension scoring (120 points) with actionable improvements. Triggers on review skill, evaluate skill, audit skill, improve skill, skill quality, SKILL.md review.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Skill Judge
Evaluate Agent Skills against official specifications and patterns derived from 17+ official examples.
## WHAT This Skill Does
Scores skills across 8 dimensions (120 points total) and provides specific, actionable improvement suggestions.
## WHEN To Use
- Reviewing/auditing a SKILL.md file
- Improving an existing skill's design
- Checking if a skill follows best practices
- Before publishing a skill to the ecosystem
**KEYWORDS**: review skill, evaluate skill, audit skill, skill quality, SKILL.md
## Installation
### OpenClaw / Moltbot / Clawbot
```bash
npx clawhub@latest install skill-judge
```
---
## Core Philosophy
### The Core Formula
> **Good Skill = Expert-only Knowledge − What Claude Already Knows**
A Skill's value = its **knowledge delta** — the gap between what it provides and what the model already knows.
| Type | Definition | Treatment |
|------|------------|-----------|
| **Expert** | Claude genuinely doesn't know this | Must keep — this is the Skill's value |
| **Activation** | Claude knows but may not think of | Keep if brief — serves as reminder |
| **Redundant** | Claude definitely knows this | Delete — wastes tokens |
**Good Skill ratio:** >70% Expert, <20% Activation, <10% Redundant
---
## Evaluation Dimensions (120 points)
### D1: Knowledge Delta (20 pts) — THE CORE DIMENSION
Does the Skill add genuine expert knowledge?
| Score | Criteria |
|-------|----------|
| 0-5 | Explains basics Claude knows (tutorials, standard library usage) |
| 6-10 | Mixed: some expert knowledge diluted by obvious content |
| 11-15 | Mostly expert knowledge with minimal redundancy |
| 16-20 | Pure knowledge delta — every paragraph earns its tokens |
**Red flags** (instant ≤5): "What is [basic concept]", step-by-step tutorials, generic best practices
**Green flags** (high delta): Decision trees, non-obvious trade-offs, edge cases from experience, "NEVER do X because [non-obvious reason]"
---
### D2: Mindset + Procedures (15 pts)
Does the Skill transfer expert thinking patterns AND domain-specific procedures?
| Score | Criteria |
|-------|----------|
| 0-3 | Only generic procedures Claude already knows |
| 4-7 | Has domain procedures but lacks thinking frameworks |
| 8-11 | Good balance: thinking patterns + domain-specific workflows |
| 12-15 | Expert-level: shapes thinking AND provides procedures Claude wouldn't know |
**Valuable thinking patterns:** "Before [action], ask yourself: Purpose? Constraints? Differentiation?"
**Valuable procedures:** Domain-specific sequences, non-obvious ordering, critical steps easy to miss
**Redundant procedures:** Generic file operations, standard programming patterns
---
### D3: Anti-Pattern Quality (15 pts)
Does the Skill have effective NEVER lists?
| Score | Criteria |
|-------|----------|
| 0-3 | No anti-patterns mentioned |
| 4-7 | Generic warnings ("avoid errors", "be careful") |
| 8-11 | Specific NEVER list with some reasoning |
| 12-15 | Expert-grade anti-patterns with WHY — things only experience teaches |
**Test:** Would an expert read the anti-pattern list and say "yes, I learned this the hard way"?
---
### D4: Specification Compliance — Especially Description (15 pts)
**The description is THE MOST IMPORTANT field.** It's the only thing the agent sees before deciding to load the skill.
| Score | Criteria |
|-------|----------|
| 0-5 | Missing frontmatter or invalid format |
| 6-10 | Has frontmatter but description is vague or incomplete |
| 11-13 | Valid frontmatter, description has WHAT but weak on WHEN |
| 14-15 | Perfect: comprehensive description with WHAT, WHEN, and trigger keywords |
**Description must answer:**
1. **WHAT**: What does this Skill do?
2. **WHEN**: In what situations should it be used?
3. **KEYWORDS**: What terms should trigger this Skill?
**Poor:** "Helps with document tasks"
**Good:** "Create, edit, and analyze .docx files. Use when working with Word documents, tracked changes, or professional document formatting."
---
### D5: Progressive Disclosure (15 pts)
Does the Skill implement proper content layering?
| Layer | Content | Size |
|-------|---------|------|
| 1: Metadata | name + description | ~100 tokens |
| 2: SKILL.md | Guidelines, decision trees | < 500 lines ideal |
| 3: Resources | scripts/, references/, assets/ | No limit |
| Score | Criteria |
|-------|----------|
| 0-5 | Everything dumped in SKILL.md (>500 lines, no structure) |
| 6-10 | Has references but unclear when to load them |
| 11-13 | Good layering with MANDATORY triggers present |
| 14-15 | Perfect: decision trees + explicit triggers + "Do NOT Load" guidance |
**Good trigger:** "**MANDATORY - READ ENTIRE FILE**: Before proceeding, you MUST read [`docx-js.md`](docx-js.md)"
**Bad trigger:** Just listing references at the end without loading guidance
---
### D6: Freedom Calibration (15 pts)
Is specificity appropriate for the task's fragility?
| Task Type | Should Have | Why |
|-----------|-------------|-----|
| Creative/Design | High freedom | Multiple valid approaches |
| Code review | Medium freedom | Principles exist but judgment required |
| File format operations | Low freedom | One wrong byte corrupts file |
| Score | Criteria |
|-------|----------|
| 0-5 | Severely mismatched (rigid scripts for creative, vague for fragile) |
| 6-10 | Partially appropriate |
| 11-13 | Good calibration for most scenarios |
| 14-15 | Perfect freedom calibration throughout |
**Test:** "If Agent makes a mistake, what's the consequence?" High consequence → Low freedom
---
### D7: Pattern Recognition (10 pts)
Does the Skill follow an established pattern?
| Pattern | ~Lines | When to Use |
|---------|--------|-------------|
| **Mindset** | ~50 | Creative tasks requiring taste |
| **Navigation** | ~30 | Multiple distinct scenarios (routes to sub-files) |
| **Philosophy** | ~150 | Art/creation requiring originality |
| **Process** | ~200 | Complex multi-step projects |
| **Tool** | ~300 | Precise operations on specific formats |
| Score | Criteria |
|-------|----------|
| 0-3 | No recognizable pattern, chaotic structure |
| 4-6 | Partially follows a pattern with significant deviations |
| 7-8 | Clear pattern with minor deviations |
| 9-10 | Masterful application of appropriate pattern |
---
### D8: Practical Usability (15 pts)
Can an Agent actually use this Skill effectively?
| Score | Criteria |
|-------|----------|
| 0-5 | Confusing, incomplete, or untested guidance |
| 6-10 | Usable but with noticeable gaps |
| 11-13 | Clear guidance for common cases |
| 14-15 | Comprehensive: edge cases, error handling, decision trees |
**Check for:** Decision trees for multi-path scenarios, working code examples, error handling/fallbacks, edge cases covered
---
## NEVER Do When Evaluating
- Give high scores just because it "looks professional"
- Ignore token waste — every redundant paragraph = deduction
- Let length impress you — 43-line Skill can outperform 500-line Skill
- Skip mentally testing the decision trees
- Forgive explaining basics with "provides helpful context"
- Overlook missing anti-patterns
- Undervalue the description field — poor description = skill never gets used
- Put "when to use" info only in the body (agent only sees description before loading)
---
## Evaluation Protocol
### Step 1: Knowledge Delta Scan
Read SKILL.md and mark each section:
- **[E] Expert**: Claude doesn't know this — value-add
- **[A] Activation**: Claude knows but reminder useful — acceptable
- **[R] Redundant**: Claude knows this — should delete
Calculate ratio: E:A:R (target >70:20:10)
### Step 2: Structure Analysis
```
[ ] Valid frontmatter (name ≤64 chars, comprehensive description)
[ ] Total lines in SKILL.md
[ ] Reference files and sizes
[ ] Pattern identification (Mindset/Navigation/Philosophy/Process/Tool)
[ ] Loading triggers present (if references exist)
```
### Step 3: Score Each Dimension
For each dimension: find evidence, assign score, note improvements if < max
### Step 4: Calculate Total & Grade
| Grade | Percentage | Meaning |
|-------|------------|---------|
| A | 90%+ (108+) | Excellent — production-ready |
| B | 80-89% (96-107) | Good — minor improvements needed |
| C | 70-79% (84-95) | Adequate — clear improvement path |
| D | 60-69% (72-83) | Below Average — significant issues |
| F | <60% (<72) | Poor — needs fundamental redesign |
### Step 5: Generate Report
```markdown
# Skill Evaluation Report: [Skill Name]
## Summary
- **Total Score**: X/120 (X%)
- **Grade**: [A/B/C/D/F]
- **Pattern**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Knowledge Ratio**: E:A:R = X:Y:Z
- **Verdict**: [One sentence]
## Dimension Scores
| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | X | 20 | |
| D2: Mindset + Procedures | X | 15 | |
| D3: Anti-Pattern Quality | X | 15 | |
| D4: Specification Compliance | X | 15 | |
| D5: Progressive Disclosure | X | 15 | |
| D6: Freedom Calibration | X | 15 | |
| D7: Pattern Recognition | X | 10 | |
| D8: Practical Usability | X | 15 | |
## Critical Issues
[Must-fix problems]
## Top 3 Improvements
1. [Highest impact with specific guidance]
2. [Second priority]
3. [Third priority]
```
---
## Common Failure Patterns
| Pattern | Symptom | Fix |
|---------|---------|-----|
| **Tutorial** | Explains what X is, basic library usage | Delete basics. Focus on expert decisions. |
| **Dump** | 800+ lines, everything included | Core in SKILL.md (<300), details in references/ |
| **Orphan References** | References exist but never loaded | Add "MANDATORY - READ" at decision points |
| **Checkbox Procedure** | Step 1, Step 2... mechanical | Transform to "Before doing X, ask yourself..." |
| **Vague Warning** | "Be careful", "avoid errors" | Specific NEVER list with concrete examples |
| **Invisible Skill** | Great content, rarely activated | Fix description: WHAT + WHEN + KEYWORDS |
| **Wrong Location** | "When to use" in body, not description | Move triggers to description field |
| **Over-Engineered** | README, CHANGELOG, CONTRIBUTING | Delete. Only what Agent needs for the task. |
---
## The Meta-Question
> **"Would an expert in this domain say: 'Yes, this captures knowledge that took me years to learn'?"**
If yes → genuine value. If no → compressing what Claude already knows.
The best Skills are **compressed expert brains** — 10 years of experience in 50 lines.Related Skills
schema-markup
Add, fix, or optimize schema markup and structured data. Use when the user mentions schema markup, structured data, JSON-LD, rich snippets, schema.org, FAQ schema, product schema, review schema, or breadcrumb schema.
prompt-engineering
Master advanced prompt engineering techniques to maximize LLM performance, reliability, and controllability in production. Use when optimizing prompts, improving LLM outputs, designing production prompt templates, or building AI-powered features.
professional-communication
Write effective professional messages for software teams. Use when drafting emails, Slack/Teams messages, meeting agendas, status updates, or translating technical concepts for non-technical audiences. Triggers on email, slack, teams, message, meeting agenda, status update, stakeholder communication, escalation, jargon translation.
persona-docs
Create persona documentation for a product or codebase. Use when asked to create persona docs, document target users, define user journeys, document onboarding flows, or when starting a new product and needing to define its audience. Persona docs should be the first documentation created for any product.
mermaid-diagrams
Create software diagrams using Mermaid syntax. Use when users need to create, visualize, or document software through diagrams including class diagrams, sequence diagrams, flowcharts, ERDs, C4 architecture diagrams, state diagrams, git graphs, and other diagram types. Triggers include requests to diagram, visualize, model, map out, or show the flow of a system.
game-changing-features
Find 10x product opportunities and high-leverage improvements. Use when the user wants strategic product thinking, mentions 10x, wants to find high-impact features, or asks what would make a product dramatically more valuable.
clear-writing
Write clear, concise prose for humans — documentation, READMEs, API docs, commit messages, error messages, UI text, reports, and explanations. Combines Strunk's rules for clearer prose with technical documentation patterns, structure templates, and review checklists.
brainstorming
Explore ideas before implementation through collaborative dialogue. Use before any creative work — creating features, building components, adding functionality, or modifying behavior. Turns ideas into fully formed designs and specs through structured conversation.
Article Illustrator
When the user wants to add illustrations to an article or blog post. Triggers on: "illustrate article", "add images to article", "generate illustrations", "article images", or requests to visually enhance written content. Analyzes article structure, identifies positions for visual aids, and generates illustrations using a Type x Style two-dimension approach.
subagent-driven-development
Execute implementation plans by dispatching a fresh subagent per task with two-stage review (spec compliance then code quality). Use when you have an implementation plan with mostly independent tasks and want high-quality, fast iteration within a single session.
skill-creator
WHAT: Guide for creating effective AI agent skills - modular packages that extend Claude's capabilities with specialized knowledge, workflows, and tools. WHEN: User wants to create, write, author, or update a skill. User asks about skill structure, SKILL.md format, or how to package domain knowledge for AI agents. KEYWORDS: "create a skill", "make a skill", "new skill", "skill template", "SKILL.md", "agent skill", "write a skill", "skill structure", "package a skill"
session-handoff
WHAT: Create comprehensive handoff documents that enable fresh AI agents to seamlessly continue work with zero ambiguity. Solves long-running agent context exhaustion problem. WHEN: (1) User requests handoff/memory/context save, (2) Context window approaches capacity, (3) Major task milestone completed, (4) Work session ending, (5) Resuming work with existing handoff. KEYWORDS: "save state", "create handoff", "context is full", "I need to pause", "resume from", "continue where we left off", "load handoff", "save progress", "session transfer", "hand off"