skills-eval

Evaluate and improve Claude skill quality through auditing

3,891 stars

Best use case

skills-eval is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Evaluate and improve Claude skill quality through auditing

Teams using skills-eval should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/nm-abstract-skills-eval/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/athola/nm-abstract-skills-eval/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/nm-abstract-skills-eval/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How skills-eval Compares

Feature / Agentskills-evalStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Evaluate and improve Claude skill quality through auditing

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

> **Night Market Skill** — ported from [claude-night-market/abstract](https://github.com/athola/claude-night-market/tree/master/plugins/abstract). For the full experience with agents, hooks, and commands, install the Claude Code plugin.


# Skills Evaluation and Improvement

## Table of Contents

1. [Overview](#overview)
2. [Quick Start](#quick-start)
3. [Evaluation Workflow](#evaluation-workflow)
4. [Evaluation and Optimization](#evaluation-and-optimization)
5. [Resources](#resources)

## Overview

This framework audits Claude skills against quality standards to improve performance and reduce token consumption. Automated tools analyze skill structure, measure context usage, and identify specific technical improvements. Run verification commands after each audit to confirm fixes work correctly.

The `skills-auditor` provides structural analysis, while the `improvement-suggester` ranks fixes by impact. Compliance is verified through the `compliance-checker`. Runtime efficiency is monitored by `tool-performance-analyzer` and `token-usage-tracker`.

## Quick Start

### Basic Audit
Run a full audit of all skills or target a specific file to identify structural issues.
```bash
# Audit all skills
make audit-all

# Audit specific skill
make audit-skill TARGET=path/to/skill/SKILL.md
```

### Analysis and Optimization
Use `skill_analyzer.py` for complexity checks and `token_estimator.py` to verify the context budget.
```bash
make analyze-skill TARGET=path/to/skill/SKILL.md
make estimate-tokens TARGET=path/to/skill/SKILL.md
```

### Improvements
Generate a prioritized plan and verify standards compliance using `improvement_suggester.py` and `compliance_checker.py`.
```bash
make improve-skill TARGET=path/to/skill/SKILL.md
make check-compliance TARGET=path/to/skill/SKILL.md
```

## Evaluation Workflow

Start with `make audit-all` to inventory skills and identify high-priority targets. For each skill requiring attention, run analysis with `analyze-skill` to map complexity. Generate an improvement plan, apply fixes, and run `check-compliance` to verify the skill meets project standards. Finalize by checking the token budget for efficiency.

## Evaluation and Optimization

Quality assessments use the `skills-auditor` and `improvement-suggester` to generate detailed reports. Performance analysis focuses on token efficiency through the `token-usage-tracker` and tool performance via `tool-performance-analyzer`. For standards compliance, the `compliance-checker` automates common fixes for structural issues.

### Scoring and Prioritization

We evaluate skills across five dimensions: structure compliance, content quality, token efficiency, activation reliability, and tool integration. Scores above 90 represent production-ready skills, while scores below 50 indicate critical issues requiring immediate attention.

Improvements are prioritized by impact. Critical issues include security vulnerabilities or broken functionality. High-priority items cover structural flaws that hinder discoverability. Medium and low priorities focus on best practices and minor optimizations.

### Structural Patterns

**Deprecated**: `skills/shared/modules/` directories. Shared modules must be relocated into the consuming skill's own `modules/` directory. The evaluator flags any remaining `skills/shared/` as a structural warning.

**Current**: Each skill owns its modules at `skills/<skill-name>/modules/`. Cross-skill references use relative paths (e.g., `../skill-authoring/modules/anti-rationalization.md`).

## Resources

### Shared Modules: Cross-Skill Patterns
- **Anti-Rationalization Patterns**: See [anti-rationalization.md](../skill-authoring/modules/anti-rationalization.md)
- **Enforcement Language**: See [enforcement-language.md](../shared-patterns/modules/workflow-patterns.md)
- **Trigger Patterns**: See [trigger-patterns.md](modules/evaluation-criteria.md)

### Skill-Specific Modules
- **Trigger Isolation Analysis**: See `modules/trigger-isolation-analysis.md`
- **Skill Authoring Best Practices**: See `modules/skill-authoring-best-practices.md`
- **Authoring Checklist**: See `modules/authoring-checklist.md`
- **Evaluation Workflows**: See `modules/evaluation-workflows.md`
- **Quality Metrics**: See `modules/quality-metrics.md`
- **Advanced Tool Use Analysis**: See `modules/advanced-tool-use-analysis.md`
- **Evaluation Framework**: See `modules/evaluation-framework.md`
- **Integration Patterns**: See `modules/integration.md`
- **Troubleshooting**: See `modules/troubleshooting.md`
- **Pressure Testing**: See `modules/pressure-testing.md`
- **Integration Testing**: See `modules/integration-testing.md`
- **Multi-Metric Evaluation**: See `modules/multi-metric-evaluation-methodology.md`
- **Performance Benchmarking**: See `modules/performance-benchmarking.md`

### Tools and Automation
- **Tools**: Executable analysis utilities in `scripts/` directory.
- **Automation**: Setup and validation scripts in `scripts/automation/`.

Related Skills

find-skills

3891
from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

ht-skills

3891
from openclaw/skills

管理灏天文库文集和文档,支持新建文集、新建文档、查询文集/文档、更新文档、修改文档归属、管理文档层级。适用于 OpenClaw 自主写文章并上传、文集创建、文档入库、文档移动等场景。

Content & Documentation

web-skills-protocol

3891
from openclaw/skills

Auto-discover and use Web Skills Protocol (WSP) skills when interacting with websites. Use this skill whenever the user asks you to interact with, use, or perform actions on a website or web service — such as searching a site, placing an order, deploying an app, or calling a web API. Before scraping HTML or guessing at interfaces, check if the site publishes a skills.txt or agents.txt file that teaches you how to use it properly. If a website has complex elements (e.g., heavy JavaScript, interactive UIs), activating this skill can also help you understand the site's purpose and capabilities. Do NOT use for local file operations or non-web tasks.

Workflow & Productivity

clawdtm-skills

3891
from openclaw/skills

Review and rate Claude Code skills. See what humans and AI agents recommend.

General Utilities

micropython-skills/sensor

3891
from openclaw/skills

MicroPython sensor reading — DHT11/22, BME280, MPU6050, ADC, ultrasonic HC-SR04, photoresistor, generic I2C sensors.

Coding & Development

micropython-skills/network

3891
from openclaw/skills

MicroPython networking — WiFi STA/AP, HTTP requests, MQTT pub/sub, BLE, NTP time sync, WebSocket.

Coding & Development

micropython-skills/diagnostic

3891
from openclaw/skills

MicroPython device diagnostics — system info, I2C/SPI bus scan, pin state, filesystem, memory, performance benchmarks.

Embedded Systems & IoT

micropython-skills/algorithm

3891
from openclaw/skills

MicroPython on-device algorithms — PID controller, moving average, Kalman filter, state machine, task scheduler, data logger.

Coding & Development

micropython-skills/actuator

3891
from openclaw/skills

MicroPython actuator control — GPIO output, PWM (LED/servo/motor), stepper motor, WS2812 NeoPixel, buzzer.

Internet of Things

micropython-skills

3891
from openclaw/skills

Program and interact with embedded development boards (ESP32, ESP32-S3, ESP32-C3, ESP8266, NodeMCU, Raspberry Pi Pico, RP2040, STM32) through real-time REPL. This skill turns microcontroller hardware into an AI-programmable co-processor — read sensors, control actuators, flash firmware, diagnose devices, and deploy algorithms. Trigger when the user mentions any dev board or hardware interaction: ESP32, ESP8266, NodeMCU, Pico, 开发板, 板子, 单片机, 嵌入式, microcontroller, development board, sensor reading, GPIO, LED, motor, relay, I2C, SPI, UART, ADC, PWM, servo, DHT, BME280, temperature sensor, 传感器, 读传感器, 控制电机, 继电器, flash firmware, 烧录, 刷固件, 刷机, mpremote, MicroPython, IoT, MQTT, WiFi on board, 设备没反应, device not responding, or any task involving programming or controlling a physical microcontroller board.

Embedded Development

ml-model-eval-benchmark

3891
from openclaw/skills

Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.

Machine Learning

rules-eval

3891
from openclaw/skills

Evaluate and validate Claude Code rules in .claude/rules/ directories. Use for frontmatter, glob patterns, and quality audits