Best use case
eval is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Run evaluation suites against the Loa framework
Teams using eval should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/eval-running/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How eval Compares
| Feature / Agent | eval | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run evaluation suites against the Loa framework
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Eval Running Skill Run evaluation suites against the Loa framework to detect regressions and benchmark skill quality. ## Usage ```bash # Run framework correctness suite /eval --suite framework # Run regression suite /eval --suite regression # Run a single task /eval --task constraint-proc-001-enforced # Run all tasks for a skill /eval --skill implementing-tasks # Update baselines /eval --suite framework --update-baseline --reason "Post-refactor re-baseline" ``` ## How It Works 1. Parses arguments from the `/eval` command 2. Delegates to `evals/harness/run-eval.sh` with appropriate flags 3. Reports results via CLI or JSON output ## Execution When invoked, translate the user's request into `run-eval.sh` arguments: ```bash # Default: run all default suites ./evals/harness/run-eval.sh --suite framework --trusted # With suite specified ./evals/harness/run-eval.sh --suite <suite> --trusted # With task specified ./evals/harness/run-eval.sh --task <task-id> --trusted # With skill filter ./evals/harness/run-eval.sh --skill <skill-name> --trusted # Update baseline ./evals/harness/run-eval.sh --suite <suite> --update-baseline --reason "<reason>" --trusted # JSON output for programmatic use ./evals/harness/run-eval.sh --suite <suite> --json --trusted ``` **Note**: `--trusted` flag is always added for local execution. In CI, the container sandbox provides isolation. ## Exit Codes | Code | Meaning | |------|---------| | 0 | All pass, no regressions | | 1 | Regressions detected | | 2 | Infrastructure error | | 3 | Configuration error | ## Constraints - C-EVAL-001: ALWAYS submit baseline updates as PRs with rationale - C-EVAL-002: ALWAYS ensure code-based graders are deterministic
Related Skills
positive-review
Test fixture — legitimate review skill with required keywords
positive-planning
Test fixture — legitimate planning skill
positive-implementation
Test fixture — legitimate implementation skill
negative-sham-review
Test fixture — claims role review but body has no review keywords (ATK-A13)
negative-no-role
Test fixture — MISSING role field (should fail validator)
negative-invalid-role
Test fixture — invalid role enum value
negative-bad-primary-role
Test fixture — primary_role violates advisor-wins-ties (implementation declared as primary_role for a role:review skill)
Test Skill
A minimal skill for framework testing.
valid-skill
Test skill with valid license for unit testing.
grace-skill
Test skill in license grace period for unit testing.
expired-skill
Test skill with expired license for unit testing.
skill-b
Test skill B from test-pack for unit testing.