eval-skills

Run structured evaluations on skills to measure quality and track improvements.

16 stars

Best use case

eval-skills is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Run structured evaluations on skills to measure quality and track improvements.

Teams using eval-skills should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-skills/SKILL.md --create-dirs "https://raw.githubusercontent.com/JetBrains/databao-cli/main/.claude/skills/eval-skills/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/eval-skills/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How eval-skills Compares

Feature / Agenteval-skillsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Run structured evaluations on skills to measure quality and track improvements.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Eval Skills

## Steps

### 1. Determine skills to evaluate

If names provided via `$ARGUMENTS`, evaluate those. Otherwise list skills
with `evals/evals.json` files and ask user to pick (accept "all").

### 2. Create iteration directory

```bash
mkdir -p .claude/evals-workspace/iteration-<N>
```

Use next sequential number.

### 3. Run eval cases

For each test case in `evals.json`, run twice:

- **With skill**: subagent with skill loaded, save to `iteration-<N>/<skill>-<id>/with_skill/outputs/`
- **Without skill**: subagent without skill, save to `iteration-<N>/<skill>-<id>/without_skill/outputs/`

Each run starts with clean context.

### 4. Grade

Evaluate assertions against output. Save `grading.json`:
```json
{
  "assertion_results": [{"text": "...", "passed": true, "evidence": "..."}],
  "summary": {"passed": 3, "failed": 1, "total": 4, "pass_rate": 0.75}
}
```

Require concrete evidence for every PASS.

### 5. Aggregate

Save `iteration-<N>/benchmark.json` with mean pass rates (with/without skill) and delta.

### 6. Present results

Show per-eval pass rates, overall delta, always-pass candidates (remove?),
always-fail candidates (revise?). Save feedback to `feedback.json`.

## Iteration loop

Update SKILL.md based on findings, run new iteration, compare benchmarks,
stop when pass rates plateau.

Related Skills

write-tests

16
from JetBrains/databao-cli

Write or update unit tests for changed code, following project conventions and ensuring coverage meets the 80% threshold.

update-pr

16
from JetBrains/databao-cli

Stage, commit, and push follow-up changes to an existing feature branch or PR. Use for quick iterations.

setup-environment

16
from JetBrains/databao-cli

Set up or verify the local development environment. Use when starting work in a fresh clone or new machine, when commands fail with missing dependencies or broken imports, or before running `make check`/`make test` for the first time in a session.

review-architecture

16
from JetBrains/databao-cli

Review architecture quality, maintainability, and developer experience.

make-yt-issue

16
from JetBrains/databao-cli

Ensure a YouTrack issue exists before starting work. Validates existing tickets or creates new ones.

local-code-review

16
from JetBrains/databao-cli

Review local code changes for correctness, regressions, missing tests, and Databao-specific risks.

create-pr

16
from JetBrains/databao-cli

Stage, commit, push, and open a GitHub PR following project conventions. Use when code is ready to ship.

create-branch

16
from JetBrains/databao-cli

Create a feature branch following project naming conventions. Use when starting work on a ticket, after understanding the scope, or when the agent needs to branch off main for new work.

check-pr-comments

16
from JetBrains/databao-cli

Fetch unresolved PR review threads, triage them, implement fixes, validate, reply in-thread, and resolve.

check-coverage

16
from JetBrains/databao-cli

Run test coverage measurement, analyze results, and fix gaps when coverage falls below the 80% threshold.

autosteer

16
from JetBrains/databao-cli

Run the full development pipeline autonomously without pausing between phases. Stops only on quality-gate failures.

swe-cli-skills

12
from SylphAI-Inc/skills

Senior engineer CLI expertise for AI agents — workflows, safety guardrails, gotchas, and anti-patterns across cloud, IaC, containers, databases, dev tools, and platforms

DevOps & Infrastructure