skill-evaluator

Evaluates agent skills against Anthropic's best practices. Use when asked to review, evaluate, assess, or audit a skill for quality. Analyzes SKILL.md structure, naming conventions, description quality, content organization, and identifies anti-patterns. Produces actionable improvement recommendations.

25 stars

Best use case

skill-evaluator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Evaluates agent skills against Anthropic's best practices. Use when asked to review, evaluate, assess, or audit a skill for quality. Analyzes SKILL.md structure, naming conventions, description quality, content organization, and identifies anti-patterns. Produces actionable improvement recommendations.

Teams using skill-evaluator should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/skill-evaluator/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/gotalab/skillport/skill-evaluator/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/skill-evaluator/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How skill-evaluator Compares

Feature / Agentskill-evaluatorStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Evaluates agent skills against Anthropic's best practices. Use when asked to review, evaluate, assess, or audit a skill for quality. Analyzes SKILL.md structure, naming conventions, description quality, content organization, and identifies anti-patterns. Produces actionable improvement recommendations.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Skill Evaluator (WIP)

Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.

## Quick Start

1. Read the skill's SKILL.md and understand its purpose
2. Run automated validation: `scripts/validate_skill.py <skill-path>`
3. Perform manual evaluation against criteria below
4. Generate evaluation report with scores and recommendations

## Evaluation Workflow

### Step 1: Automated Validation

Run the validation script first:

```bash
scripts/validate_skill.py <path/to/skill>
```

This checks:
- SKILL.md exists with valid YAML frontmatter
- Name follows conventions (lowercase, hyphens, max 64 chars)
- Description is present and under 1024 chars
- Body is under 500 lines
- File references are one-level deep

### Step 2: Manual Evaluation

Evaluate each dimension and assign a score (1-5):

#### A. Naming (Weight: 10%)

| Score | Criteria |
|-------|----------|
| 5 | Gerund form (-ing), clear purpose, memorable |
| 4 | Descriptive, follows conventions |
| 3 | Acceptable but could be clearer |
| 2 | Vague or misleading |
| 1 | Violates naming rules |

**Rules**: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.

**Good**: `processing-pdfs`, `analyzing-spreadsheets`, `building-dashboards`
**Bad**: `pdf`, `my-skill`, `ClaudeHelper`, `anthropic-tools`

#### B. Description (Weight: 20%)

| Score | Criteria |
|-------|----------|
| 5 | Clear functionality + specific activation triggers + third person |
| 4 | Good description with some triggers |
| 3 | Adequate but missing triggers or vague |
| 2 | Too brief or unclear purpose |
| 1 | Missing or unhelpful |

**Must include**: What the skill does AND when to use it.
**Good**: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis."
**Bad**: "A skill for PDFs." or "Helps with documents."

#### C. Content Quality (Weight: 30%)

| Score | Criteria |
|-------|----------|
| 5 | Concise, assumes Claude intelligence, actionable instructions |
| 4 | Generally good, minor verbosity |
| 3 | Some unnecessary explanations or redundancy |
| 2 | Overly verbose or confusing |
| 1 | Bloated, explains obvious concepts |

**Ask**: "Does Claude really need this explanation?" Remove anything Claude already knows.

#### D. Structure & Organization (Weight: 25%)

| Score | Criteria |
|-------|----------|
| 5 | Excellent progressive disclosure, clear navigation, optimal length |
| 4 | Good organization, appropriate file splits |
| 3 | Acceptable but could be better organized |
| 2 | Poor organization, missing references, or bloated SKILL.md |
| 1 | No structure, everything dumped in SKILL.md |

**Check**:
- SKILL.md under 500 lines
- References are one-level deep (no nested chains)
- Long reference files (>100 lines) have table of contents
- Uses forward slashes in all paths

#### E. Degrees of Freedom (Weight: 10%)

| Score | Criteria |
|-------|----------|
| 5 | Perfect match: high freedom for flexible tasks, low for fragile operations |
| 4 | Generally appropriate freedom levels |
| 3 | Acceptable but could be better calibrated |
| 2 | Mismatched: too rigid or too loose |
| 1 | Completely wrong freedom level for the task type |

**Guideline**:
- High freedom (text): Multiple valid approaches, context-dependent
- Medium freedom (parameterized): Preferred pattern exists, some variation OK
- Low freedom (specific scripts): Fragile operations, exact sequence required

#### F. Anti-Pattern Check (Weight: 5%)

Deduct points for each anti-pattern found:

- [ ] Too many options without clear recommendation (-1)
- [ ] Time-sensitive information with date conditionals (-1)
- [ ] Inconsistent terminology (-1)
- [ ] Windows-style paths (backslashes) (-1)
- [ ] Deeply nested references (more than one level) (-2)
- [ ] Scripts that punt error handling to Claude (-1)
- [ ] Magic numbers without justification (-1)

### Step 3: Generate Report

Use this template:

```markdown
# Skill Evaluation Report: [skill-name]

## Summary
- **Overall Score**: X.X/5.0
- **Recommendation**: [Ready for publication / Needs minor improvements / Needs major revision]

## Dimension Scores
| Dimension | Score | Weight | Weighted |
|-----------|-------|--------|----------|
| Naming | X/5 | 10% | X.XX |
| Description | X/5 | 20% | X.XX |
| Content Quality | X/5 | 30% | X.XX |
| Structure | X/5 | 25% | X.XX |
| Degrees of Freedom | X/5 | 10% | X.XX |
| Anti-Patterns | X/5 | 5% | X.XX |
| **Total** | | 100% | **X.XX** |

## Strengths
- [List 2-3 things done well]

## Areas for Improvement
- [List specific issues with actionable fixes]

## Anti-Patterns Found
- [List any anti-patterns detected]

## Recommendations
1. [Priority 1 fix]
2. [Priority 2 fix]
3. [Priority 3 fix]

## Pre-Publication Checklist
- [ ] Description is specific with activation triggers
- [ ] SKILL.md under 500 lines
- [ ] One-level-deep file references
- [ ] Forward slashes in all paths
- [ ] No time-sensitive information
- [ ] Consistent terminology
- [ ] Concrete examples provided
- [ ] Scripts handle errors explicitly
- [ ] All configuration values justified
- [ ] Required packages listed
- [ ] Tested with Haiku, Sonnet, Opus
```

## Score Interpretation

| Score Range | Rating | Action |
|-------------|--------|--------|
| 4.5 - 5.0 | Excellent | Ready for publication |
| 4.0 - 4.4 | Good | Minor improvements recommended |
| 3.0 - 3.9 | Acceptable | Several improvements needed |
| 2.0 - 2.9 | Needs Work | Major revision required |
| 1.0 - 1.9 | Poor | Fundamental redesign needed |

## References

- [references/evaluation-criteria.md](references/evaluation-criteria.md) - Detailed evaluation criteria with examples
- [references/scoring-rubric.md](references/scoring-rubric.md) - Complete scoring rubric and edge cases

## Examples

See [evaluations/](evaluations/) for example evaluation scenarios.

Related Skills

tech-stack-evaluator

25
from ComeOnOliver/skillshub

Comprehensive technology stack evaluation and comparison tool with TCO analysis, security assessment, and intelligent recommendations for engineering teams

ux-evaluator

25
from ComeOnOliver/skillshub

This skill should be used when evaluating UI components against UX best practices. Use for reviewing buttons, navigation elements, spacing, visual hierarchy, or any interface element. Provides a systematic 3-dimension framework (Position, Visual Weight, Spacing) aligned with industry standards (Balsamiq, Nielsen heuristics). Invoke when user asks to "review UX", "check button design", "evaluate layout", or references design guidelines.

NeMo Evaluator SDK - Enterprise LLM Benchmarking

25
from ComeOnOliver/skillshub

## Quick Start

langsmith-evaluator

25
from ComeOnOliver/skillshub

INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run via LangSmith. Uses the langsmith CLI tool.

Daily Logs

25
from ComeOnOliver/skillshub

Record the user's daily activities, progress, decisions, and learnings in a structured, chronological format.

Socratic Method: The Dialectic Engine

25
from ComeOnOliver/skillshub

This skill transforms Claude into a Socratic agent — a cognitive partner who guides

Sokratische Methode: Die Dialektik-Maschine

25
from ComeOnOliver/skillshub

Dieser Skill verwandelt Claude in einen sokratischen Agenten — einen kognitiven Partner, der Nutzende durch systematisches Fragen zur Wissensentdeckung führt, anstatt direkt zu instruieren.

College Football Data (CFB)

25
from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

College Basketball Data (CBB)

25
from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

Betting Analysis

25
from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for odds formats, command parameters, and key concepts.

Research Proposal Generator

25
from ComeOnOliver/skillshub

Generate high-quality academic research proposals for PhD applications following Nature Reviews-style academic writing conventions.

Paper Slide Deck Generator

25
from ComeOnOliver/skillshub

Transform academic papers and content into professional slide deck images with automatic figure extraction.