evaluation
Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML.
Best use case
evaluation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML.
Teams using evaluation should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/evaluation/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How evaluation Compares
| Feature / Agent | evaluation | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Model Evaluation
Config-first evaluation framework for testing model responses against YAML-defined correctness assertions.
The evaluator does not hardcode a specific tool family, manager id, wrapper name, or behavior rule as correctness. Scenarios define the prompt and the acceptable response shape directly under `correct`.
## Quick Reference
| Task | Command |
|------|---------|
| Interactive menu | `./run.sh` then Evaluate |
| Tool CLI eval | `python -m Evaluator.cli --backend vllm --model MODEL --scenario tool_prompts.yaml --host 127.0.0.1 --port 8011` |
| Full configured eval | `python -m Evaluator.cli --backend lmstudio --model MODEL --preset full` |
| Quick smoke test | `python -m Evaluator.cli --backend lmstudio --model MODEL --preset quick` |
| Tag filter | `python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --tags storageManager` |
| Dry run config load | `python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --dry-run` |
| Eval with environment runtime | `python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --env-backend local` |
| Eval with LLM judge | `python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --judge --judge-rubrics tool_call_quality` |
| Eval + upload to HF | `python -m Evaluator.cli --backend unsloth --model PATH --upload-to-hf user/model` |
## Status System
| Status | Meaning | When |
|--------|---------|------|
| **PASS** | Configured checks passed | `correct` assertions passed, and optional environment/judge checks passed |
| **FAIL** | Configured checks failed or request errored | No `correct.any` path matched, required environment checks failed, judge failed, or backend errored |
Schema/structural validation may still be reported for debugging, but it is not the source of task correctness. Correctness belongs in scenario YAML.
## Key Directories
- `Evaluator/` - Core evaluation code
- `Evaluator/config/scenarios/` - YAML test scenarios
- `Evaluator/config/tool_schema.yaml` - Current CLI wrapper/tool schema metadata
- `Evaluator/config/rubrics/` - LLM-as-judge rubrics
- `Evaluator/results/` - Evaluation output JSON and Markdown
## Progressive Reference
| Reference | When to Load | Path |
|-----------|-------------|------|
| CLI Commands | Running evaluations, all flags and examples | `reference/cli-commands.md` |
| Scenario Authoring | Writing or modifying YAML test scenarios | `reference/scenario-authoring.md` |
| Backends | Configuring vLLM, LM Studio, Ollama, Unsloth, and others | `reference/backends.md` |
| Results & Metrics | Interpreting JSON/Markdown output and failures | `reference/results-metrics.md` |
| Presets & Tags | Using presets and tag filters | `reference/presets-tags.md` |
## Active Scenario Pattern
Every test should define what counts as correct:
```yaml
tests:
- id: storage_copy_runbook
question: Copy the incident runbook into a template file.
tags: [storageManager, single-tool]
system: |
<session_context>
sessionId: "session_eval"
workspaceId: "ws_eval"
</session_context>
correct:
any:
- name: copy_cli
assertions:
- type: jsonpath_equals
path: $.tool_calls[0].name
value: useTools
- type: jsonpath_regex
path: $.tool_calls[0].arguments.tool
pattern: '^storage copy\b(?=.*Incident-Response\.md)(?=.*Incident-Response-Template\.md)'
```
Use `correct.any` for multiple valid answers, such as command by id or by name. Use `correct.all` or nested `all`/`any`/`not` assertions for stricter structures.
## Response View
Assertions query a generic response view. This is syntax normalization only:
- `$.raw` preserves the raw assistant response.
- `$.content` is assistant text.
- `$.content_json` is parsed JSON content when content is JSON.
- `$.tool_calls` is a normalized list of emitted tool calls.
- OpenAI-style `function.arguments` JSON strings are parsed into objects.
- Plain text blocks like `tool_call: useTools` plus `arguments: {...}` are parsed into the same view.
The response view must not map CLI commands to old manager tool ids or decide correctness. Scenario YAML decides what is correct.
## Tips
- Keep all task-specific expectations in YAML under `correct`.
- Do not add evaluator code for a specific tool, wrapper, or use case.
- Prefer regex or JSONPath assertions for tool CLI commands, because shell quoting and argument order can vary.
- If a schema allows equivalent forms, represent them as separate `correct.any` paths.
- Use `--limit` and `--tags` for fast iteration.
- Use `--validate-context` only when the scenario includes context fields that should be structurally checked.
- Use `--env-backend local` or `e2b` only when you need runtime execution checks beyond response correctness.Related Skills
upload-deployment
Complete reference for model upload and deployment. Covers HuggingFace upload, save strategies (LoRA, merged 16-bit, merged 4-bit), GGUF conversion, model merging, model cards, and the full upload workflow. Use when uploading models, creating GGUF files, merging LoRA adapters, or deploying to HuggingFace. This skill is about USING the upload/deployment tools via CLI — never modifying source code.
synthetic-data-generation
Complete reference for the SynthChat synthetic dataset generation system. Covers CLI commands (generate, improve, validate), scenario YAML authoring, rubric YAML authoring, settings configuration, evaluation, and full workflow. Use when generating datasets, writing rubrics/scenarios, configuring models/workers, improving dataset quality, or running evaluations. This skill is about USING the system via CLI and YAML — never modifying source code.
research-reporting
Create structured research notes from experiment runs and analysis artifacts. Use when creating a note at run launch, updating it as training/evaluation/loss stages finish, summarizing a finished run, comparing experiment outcomes, extracting hypotheses from eval/loss artifacts, or proposing next-run actions grounded in `.tracking/experiments/<id>/analysis/` outputs. This skill is about turning repo-native experiment evidence into stable, machine-readable markdown.
fine-tuning
Complete reference for the fine-tuning pipeline (SFT, KTO, GRPO), cloud HF Jobs workflows, autonomous experiment search, checkpoint evaluation, and LoRA surgery. Covers training CLI flags, YAML configuration, model presets, dataset requirements, LoRA settings, training monitoring, hyperparameter search, and post-training optimization. Use when training models, configuring training runs, choosing hyperparameters, running cloud experiments, inspecting HF jobs, or troubleshooting training issues. This skill is about USING the training system via CLI and YAML — never modifying source code.
dataset-publishing
Publish local dataset artifacts to a Hugging Face dataset repo. Use when uploading a JSONL dataset, pushing a filtered dataset variant, syncing a matching .metadata.json sidecar, or renaming a dataset file in the target repo. This skill is about USING the checked-in dataset publish script via CLI — never ad hoc Python.
case-studies
End-to-end case studies showing how to implement the full training pipeline for different skill types. Covers three complete worked examples — tool-calling training, essay-style training, and agentic search (RAG agent) training — demonstrating dataset design, synthetic generation, validation, fine-tuning, evaluation, and iteration. Use when onboarding to the project, understanding how all components fit together, explaining the pipeline to others, or planning a new training capability. This skill is about UNDERSTANDING the system holistically — reference the other skills for specific CLI commands.
llm-evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
hugging-face-evaluation
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
evaluation
Build evaluation frameworks for agent systems. Use when testing agent performance systematically, validating context engineering choices, or measuring improvements over time.
agent-evaluation
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks
advanced-evaluation
This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
llm-evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.