agentv-eval-builder

Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

agentv-eval-builder is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using agentv-eval-builder should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/agentv-eval-builder/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/ai-agents/agentv-eval-builder/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/agentv-eval-builder/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How agentv-eval-builder Compares

Feature / Agent	agentv-eval-builder	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

SKILL.md Source

# AgentV Eval Builder

## Schema Reference
- Schema: `references/eval-schema.json` (JSON Schema for validation and tooling)
- Format: YAML or JSONL (see below)
- Examples: `references/example-evals.md`

## Feature Reference
- Rubrics: `references/rubric-evaluator.md` - Structured criteria-based evaluation
- Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
- Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
- Structured Data + Metrics: `references/structured-data-evaluators.md` - `field_accuracy`, `latency`, `cost`
- Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
- Compare: `references/compare-command.md` - Compare evaluation results between runs

## Structure Requirements
- Root level: `description` (optional), `execution` (with `target`), `evalcases` (required)
- Eval case fields: `id` (required), `expected_outcome` (required), `input_messages` or `input` (required)
- Optional fields: `expected_messages` (or `expected_output`), `conversation_id`, `rubrics`, `execution`
- `expected_messages` is optional - omit for outcome-only evaluation where the LLM judge evaluates based on `expected_outcome` criteria alone
- Message fields: `role` (required), `content` (required)
- Message roles: `system`, `user`, `assistant`, `tool`

## Input/Output Shorthand (Aliases)

For simpler eval cases, use shorthand aliases instead of the verbose `input_messages` and `expected_messages`:

| Alias | Canonical | Description |
|-------|-----------|-------------|
| `input` | `input_messages` | String expands to single user message |
| `expected_output` | `expected_messages` | String/object expands to single assistant message |

**String shorthand:**
```yaml
evalcases:
  - id: simple-test
    expected_outcome: Correct answer
    input: "What is 2+2?"                    # Expands to [{role: user, content: "..."}]
    expected_output: "The answer is 4"       # Expands to [{role: assistant, content: "..."}]
```

**Object shorthand** (for structured output validation):
```yaml
evalcases:
  - id: structured-output
    expected_outcome: Risk assessment
    input: "Analyze this transaction"
    expected_output:                          # Expands to assistant message with object content
      riskLevel: High
      confidence: 0.95
```

**Array syntax** still works for multi-message conversations:
```yaml
input:
  - role: system
    content: "You are a calculator"
  - role: user
    content: "What is 2+2?"
```

**Precedence:** Canonical names (`input_messages`, `expected_messages`) take precedence when both are specified.
- Content types: `text` (inline), `file` (relative or absolute path)
- Attachments (type: `file`) should default to the `user` role
- File paths: Relative (from eval file dir) or absolute with "/" prefix (from repo root)

## JSONL Format

For large-scale evaluations, use JSONL (one eval case per line) instead of YAML:

**dataset.jsonl:**
```jsonl
{"id": "test-1", "expected_outcome": "Correct answer", "input_messages": [{"role": "user", "content": "What is 2+2?"}]}
{"id": "test-2", "expected_outcome": "Clear explanation", "input_messages": [{"role": "user", "content": [{"type": "text", "value": "Review this"}, {"type": "file", "value": "./code.py"}]}]}
```

**dataset.yaml (optional sidecar for defaults):**
```yaml
description: My dataset
dataset: my-tests
execution:
  target: azure_base
evaluator: llm_judge
```

Benefits: Git-friendly diffs, streaming-compatible, easy programmatic generation.
Per-case fields override sidecar defaults. See `examples/features/basic-jsonl/` for complete example.

## Custom Evaluators

Configure multiple evaluators per eval case via `execution.evaluators` array.

### Code Evaluators
Scripts that validate output programmatically:

```yaml
execution:
  evaluators:
    - name: json_format_validator
      type: code_judge
      script: uv run validate_output.py
      cwd: ../../evaluators/scripts
```

**Contract:**
- Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files`, `input_files`, `input_messages`, `expected_messages`, `output_messages`, `trace_summary`
- Output (stdout): JSON with `score` (0.0-1.0), `hits`, `misses`, `reasoning`

**Target Proxy:** Code evaluators can access an LLM through the target proxy for sophisticated evaluation logic (e.g., Contextual Precision, semantic similarity). Enable with `target: {}`:

```yaml
execution:
  evaluators:
    - name: contextual_precision
      type: code_judge
      script: bun run evaluate.ts
      target: {}           # Enable target proxy (max_calls: 50 default)
```

**RAG Evaluation Pattern:** For retrieval-based evals, pass retrieval context via `expected_messages.tool_calls`:

```yaml
expected_messages:
  - role: assistant
    tool_calls:
      - tool: vector_search
        output:
          results: ["doc1", "doc2", "doc3"]
```

**TypeScript evaluators:** Keep `.ts` source files and run them via Node-compatible loaders such as `npx --yes tsx` so global `agentv` installs stay portable. See `references/custom-evaluators.md` for complete templates, target proxy usage, and command examples.

**Template:** See `references/custom-evaluators.md` for Python and TypeScript templates

### LLM Judges
Language models evaluate response quality:

```yaml
execution:
  evaluators:
    - name: content_evaluator
      type: llm_judge
      prompt: /evaluators/prompts/correctness.md
      model: gpt-5-chat
```

### Tool Trajectory Evaluators
Validate agent tool usage patterns (requires `output_messages` with `tool_calls` from provider):

```yaml
execution:
  evaluators:
    - name: research_check
      type: tool_trajectory
      mode: any_order       # Options: any_order, in_order, exact
      minimums:             # For any_order mode
        knowledgeSearch: 2
      expected:             # For in_order/exact modes
        - tool: knowledgeSearch
        - tool: documentRetrieve
```

See `references/tool-trajectory-evaluator.md` for modes and configuration.

### Multiple Evaluators
Define multiple evaluators to run sequentially. The final score is a weighted average of all results.

```yaml
execution:
  evaluators:
    - name: format_check      # Runs first
      type: code_judge
      script: uv run validate_json.py
    - name: content_check     # Runs second
      type: llm_judge
```

### Rubric Evaluator
Inline rubrics for structured criteria-based evaluation:

```yaml
evalcases:
  - id: explanation-task
    expected_outcome: Clear explanation of quicksort
    input_messages:
      - role: user
        content: Explain quicksort
    rubrics:
      - Mentions divide-and-conquer approach
      - Explains the partition step
      - id: complexity
        description: States time complexity correctly
        weight: 2.0
        required: true
```

See `references/rubric-evaluator.md` for detailed rubric configuration.

### Composite Evaluator
Combine multiple evaluators with aggregation:

```yaml
execution:
  evaluators:
    - name: release_gate
      type: composite
      evaluators:
        - name: safety
          type: llm_judge
          prompt: ./prompts/safety.md
        - name: quality
          type: llm_judge
          prompt: ./prompts/quality.md
      aggregator:
        type: weighted_average
        weights:
          safety: 0.3
          quality: 0.7
```

See `references/composite-evaluator.md` for aggregation types and patterns.

### Batch CLI Evaluation
Evaluate external batch runners that process all evalcases in one invocation:

```yaml
description: Batch CLI evaluation
execution:
  target: batch_cli

evalcases:
  - id: case-001
    expected_outcome: Returns decision=CLEAR
    expected_messages:
      - role: assistant
        content:
          decision: CLEAR
    input_messages:
      - role: user
        content:
          row:
            id: case-001
            amount: 5000
    execution:
      evaluators:
        - name: decision-check
          type: code_judge
          script: bun run ./scripts/check-output.ts
          cwd: .
```

**Key pattern:**
- Batch runner reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
- Each evalcase has its own evaluator to validate its corresponding output
- Use structured `expected_messages.content` for expected output fields

See `references/batch-cli-evaluator.md` for full implementation guide.

## Example
```yaml
description: Example showing basic features and conversation threading
execution:
  target: default

evalcases:
  - id: code-review-basic
    expected_outcome: Assistant provides helpful code analysis
    
    input_messages:
      - role: system
        content: You are an expert code reviewer.
      - role: user
        content:
          - type: text
            value: |-
              Review this function:
              
              ```python
              def add(a, b):
                  return a + b
              ```
          - type: file
            value: /prompts/python.instructions.md
    
    expected_messages:
      - role: assistant
        content: |-
          The function is simple and correct. Suggestions:
          - Add type hints: `def add(a: int, b: int) -> int:`
          - Add docstring
          - Consider validation for edge cases
```

Related Skills

llm-evaluate

from diegosouzapw/awesome-omni-skill

Evaluate LLM models for cost/performance ratio. Fetches current pricing and recommends optimal model for your use case. Use during project init or when optimizing costs.

agentv-prompt-optimizer

from diegosouzapw/awesome-omni-skill

Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.

Agent Evaluation

from diegosouzapw/awesome-omni-skill

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

agent-builder

from diegosouzapw/awesome-omni-skill

Build AI agents using pai-agent-sdk with Pydantic AI. Covers agent creation via create_agent(), toolset configuration, session persistence with ResumableState, subagent hierarchies, and browser automation. Use when creating agent applications, configuring custom tools, managing multi-turn sessions, setting up hierarchical agents, or implementing HITL approval flows.

advanced-evaluation

from diegosouzapw/awesome-omni-skill

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or establishing quality standards for AI-generated content.

Advisory Board Builder

from diegosouzapw/awesome-omni-skill

Recruit, structure, and manage advisory boards for strategic guidance

web-backend-builder

from diegosouzapw/awesome-omni-skill

Scaffold backend API, data models, ORM setup, and endpoint inventory with OpenAPI output.

mcp-builder

from diegosouzapw/awesome-omni-skill

Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP), Node/TypeScript (MCP SDK), or C#/.NET (Microsoft MCP SDK).

mcp-builder-microsoft

from diegosouzapw/awesome-omni-skill

api-request-builder

from diegosouzapw/awesome-omni-skill

Build a basic HTTP request (curl or fetch) for an API. Use when a junior developer needs a quick request example.

api-integration-builder

from diegosouzapw/awesome-omni-skill

Build reliable third-party API integrations including OAuth, webhooks, rate limiting, error handling, and data sync. Use when integrating with external services (Slack, Stripe, Gmail, etc.), building API connections, handling webhooks, or implementing OAuth flows.

api-endpoint-builder

from diegosouzapw/awesome-omni-skill

Build REST API endpoints when designing or implementing API routes with security best practices. Not for client-side fetching or non-API logic.