eval-skills
Run structured evaluations on skills to measure quality and track improvements.
Best use case
eval-skills is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Run structured evaluations on skills to measure quality and track improvements.
Teams using eval-skills should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/eval-skills/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How eval-skills Compares
| Feature / Agent | eval-skills | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run structured evaluations on skills to measure quality and track improvements.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Eval Skills
## Steps
### 1. Determine skills to evaluate
If names provided via `$ARGUMENTS`, evaluate those. Otherwise list skills
with `evals/evals.json` files and ask user to pick (accept "all").
### 2. Create iteration directory
```bash
mkdir -p .claude/evals-workspace/iteration-<N>
```
Use next sequential number.
### 3. Run eval cases
For each test case in `evals.json`, run twice:
- **With skill**: subagent with skill loaded, save to `iteration-<N>/<skill>-<id>/with_skill/outputs/`
- **Without skill**: subagent without skill, save to `iteration-<N>/<skill>-<id>/without_skill/outputs/`
Each run starts with clean context.
### 4. Grade
Evaluate assertions against output. Save `grading.json`:
```json
{
"assertion_results": [{"text": "...", "passed": true, "evidence": "..."}],
"summary": {"passed": 3, "failed": 1, "total": 4, "pass_rate": 0.75}
}
```
Require concrete evidence for every PASS.
### 5. Aggregate
Save `iteration-<N>/benchmark.json` with mean pass rates (with/without skill) and delta.
### 6. Present results
Show per-eval pass rates, overall delta, always-pass candidates (remove?),
always-fail candidates (revise?). Save feedback to `feedback.json`.
## Iteration loop
Update SKILL.md based on findings, run new iteration, compare benchmarks,
stop when pass rates plateau.Related Skills
write-tests
Write or update unit tests for changed code, following project conventions and ensuring coverage meets the 80% threshold.
update-pr
Stage, commit, and push follow-up changes to an existing feature branch or PR. Use for quick iterations.
setup-environment
Set up or verify the local development environment. Use when starting work in a fresh clone or new machine, when commands fail with missing dependencies or broken imports, or before running `make check`/`make test` for the first time in a session.
review-architecture
Review architecture quality, maintainability, and developer experience.
make-yt-issue
Ensure a YouTrack issue exists before starting work. Validates existing tickets or creates new ones.
local-code-review
Review local code changes for correctness, regressions, missing tests, and Databao-specific risks.
create-pr
Stage, commit, push, and open a GitHub PR following project conventions. Use when code is ready to ship.
create-branch
Create a feature branch following project naming conventions. Use when starting work on a ticket, after understanding the scope, or when the agent needs to branch off main for new work.
check-pr-comments
Fetch unresolved PR review threads, triage them, implement fixes, validate, reply in-thread, and resolve.
check-coverage
Run test coverage measurement, analyze results, and fix gaps when coverage falls below the 80% threshold.
autosteer
Run the full development pipeline autonomously without pausing between phases. Stops only on quality-gate failures.
swe-cli-skills
Senior engineer CLI expertise for AI agents — workflows, safety guardrails, gotchas, and anti-patterns across cloud, IaC, containers, databases, dev tools, and platforms