evalview-agent-testing

Regression testing for AI agents using EvalView. Snapshot agent behavior, detect regressions in tool calls and output quality, and block broken agents before production.

125,951 stars

Best use case

evalview-agent-testing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Regression testing for AI agents using EvalView. Snapshot agent behavior, detect regressions in tool calls and output quality, and block broken agents before production.

Teams using evalview-agent-testing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/evalview-agent-testing/SKILL.md --create-dirs "https://raw.githubusercontent.com/affaan-m/everything-claude-code/main/skills/evalview-agent-testing/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/evalview-agent-testing/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How evalview-agent-testing Compares

Feature / Agentevalview-agent-testingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Regression testing for AI agents using EvalView. Snapshot agent behavior, detect regressions in tool calls and output quality, and block broken agents before production.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# EvalView Agent Testing

Automated regression testing for AI agents. EvalView snapshots your agent's behavior (tool calls, parameters, sequence, output), then diffs against the baseline after every change. When something breaks, you know immediately — before it ships.

## When to Activate

- After modifying agent code, prompts, or tool definitions
- After a model update or provider change
- Before deploying an agent to production
- When setting up CI/CD for an agent project
- When an autonomous loop (OpenClaw, coding agents) needs a fitness function
- When agent output changes unexpectedly and you need to identify what shifted

## Core Workflow

```bash
# 1. Set up
pip install "evalview>=0.5,<1"
evalview init              # Detect agent, create starter test suite

# 2. Baseline
evalview snapshot           # Save current behavior as golden baseline

# 3. Gate every change
evalview check              # Diff against baseline — catches regressions

# 4. Monitor in production
evalview monitor --slack-webhook https://hooks.slack.com/services/...
```

## Understanding Check Results

| Status | Meaning | Action |
|--------|---------|--------|
| `PASSED` | Behavior matches baseline | Ship with confidence |
| `TOOLS_CHANGED` | Different tools called | Review the diff |
| `OUTPUT_CHANGED` | Same tools, output shifted | Review the diff |
| `REGRESSION` | Score dropped significantly | Fix before shipping |

## Python API for Autonomous Loops

Use `gate()` as a programmatic regression gate inside agent frameworks, autonomous coding loops, or CI scripts:

```python
from evalview import gate, DiffStatus

# Full evaluation
result = gate(test_dir="tests/")
if not result.passed:
    for d in result.diffs:
        if not d.passed:
            delta = f" ({d.score_delta:+.1f})" if d.score_delta is not None else ""
            print(f"  {d.test_name}: {d.status.value}{delta}")

# Quick mode — no LLM judge, $0, sub-second
result = gate(test_dir="tests/", quick=True)
```

### Auto-Revert on Regression

```python
from evalview.openclaw import gate_or_revert

# In an autonomous coding loop:
make_code_change()
if not gate_or_revert("tests/", quick=True):
    # Change was automatically reverted
    try_alternative_approach()
```

> **Warning:** `gate_or_revert` runs `git checkout -- .` when a regression is detected, discarding uncommitted changes. Commit or stash work-in-progress before entering the loop. You can also pass a custom revert command: `gate_or_revert("tests/", revert_cmd="git stash")`.

## MCP Integration

EvalView exposes 8 tools via MCP — works with Claude Code, Cursor, and any MCP client:

```bash
claude mcp add --transport stdio evalview -- evalview mcp serve
```

Tools: `create_test`, `run_snapshot`, `run_check`, `list_tests`, `validate_skill`, `generate_skill_tests`, `run_skill_test`, `generate_visual_report`

After connecting, Claude Code can proactively check for regressions after code changes:
- "Did my refactor break anything?" triggers `run_check`
- "Save this as the new baseline" triggers `run_snapshot`
- "Add a test for the weather tool" triggers `create_test`

## CI/CD Integration

```yaml
# .github/workflows/evalview.yml
name: Agent Regression Check
on: [pull_request, push]
jobs:
  check:
    runs-on: ubuntu-latest
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - run: pip install "evalview>=0.5,<1"
      - run: evalview check --fail-on REGRESSION
```

`--fail-on REGRESSION` gates on score drops only. For stricter gating that also catches tool sequence changes, use `--fail-on REGRESSION,TOOLS_CHANGED` or `--strict` (fails on any change).

## Test Case Format

```yaml
name: refund-flow
input:
  query: "I need a refund for order #4812"
expected:
  tools: ["lookup_order", "check_refund_policy", "issue_refund"]
  forbidden_tools: ["delete_order"]
  output:
    contains: ["refund", "processed"]
    not_contains: ["error"]
thresholds:
  min_score: 70
```

Multi-turn tests are also supported:

```yaml
name: clarification-flow
turns:
  - query: "I want a refund"
    expected:
      output:
        contains: ["order number"]
  - query: "Order 4812"
    expected:
      tools: ["lookup_order", "issue_refund"]
```

## Best Practices

- **Snapshot after every intentional change.** Baselines should reflect intended behavior.
- **Use `--preview` before snapshotting.** `evalview snapshot --preview` shows what would change without saving.
- **Quick mode for tight loops.** `gate(quick=True)` skips the LLM judge — free and fast for iterative development.
- **Full evaluation for final validation.** Run without `quick=True` before deploying to get LLM-as-judge scoring.
- **Commit `.evalview/golden/` to git.** Baselines should be versioned. Don't commit `state.json`.
- **Use variants for non-deterministic agents.** `evalview snapshot --variant v2` stores alternate valid behaviors (up to 5).
- **Monitor in production.** `evalview monitor` catches gradual drift that individual checks miss.

## Installation

```bash
pip install "evalview>=0.5,<1"
```

Package: [evalview on PyPI](https://pypi.org/project/evalview/)

Related Skills

compose-multiplatform-patterns

144923
from affaan-m/everything-claude-code

KMP项目中的Compose Multiplatform和Jetpack Compose模式——状态管理、导航、主题化、性能优化和平台特定UI。

java-coding-standards

144923
from affaan-m/everything-claude-code

Spring Bootサービス向けのJavaコーディング標準:命名、不変性、Optional使用、ストリーム、例外、ジェネリクス、プロジェクトレイアウト。

continuous-learning

144923
from affaan-m/everything-claude-code

Claude Codeセッションから再利用可能なパターンを自動的に抽出し、将来の使用のために学習済みスキルとして保存します。

social-graph-ranker

144923
from affaan-m/everything-claude-code

Weighted social-graph ranking for warm intro discovery, bridge scoring, and network gap analysis across X and LinkedIn. Use when the user wants the reusable graph-ranking engine itself, not the broader outreach or network-maintenance workflow layered on top of it.

remotion-video-creation

144923
from affaan-m/everything-claude-code

Best practices for Remotion - Video creation in React. 29 domain-specific rules covering 3D, animations, audio, captions, charts, transitions, and more.

opensource-pipeline

144923
from affaan-m/everything-claude-code

Open-source pipeline: fork, sanitize, and package private projects for safe public release. Chains 3 agents (forker, sanitizer, packager). Triggers: '/opensource', 'open source this', 'make this public', 'prepare for open source'.

lead-intelligence

144923
from affaan-m/everything-claude-code

AI-native lead intelligence and outreach pipeline. Replaces Apollo, Clay, and ZoomInfo with agent-powered signal scoring, mutual ranking, warm path discovery, source-derived voice modeling, and channel-specific outreach across email, LinkedIn, and X. Use when the user wants to find, qualify, and reach high-value contacts.

hexagonal-architecture

144923
from affaan-m/everything-claude-code

Design, implement, and refactor Ports & Adapters systems with clear domain boundaries, dependency inversion, and testable use-case orchestration across TypeScript, Java, Kotlin, and Go services.

gan-style-harness

144923
from affaan-m/everything-claude-code

GAN-inspired Generator-Evaluator agent harness for building high-quality applications autonomously. Based on Anthropic's March 2026 harness design paper.

autonomous-agent-harness

144923
from affaan-m/everything-claude-code

Transform Claude Code into a fully autonomous agent system with persistent memory, scheduled operations, computer use, and task queuing. Replaces standalone agent frameworks (Hermes, AutoGPT) by leveraging Claude Code's native crons, dispatch, MCP tools, and memory. Use when the user wants continuous autonomous operation, scheduled tasks, or a self-directing agent loop.

javascript-testing-patterns

31392
from sickn33/antigravity-awesome-skills

Comprehensive guide for implementing robust testing strategies in JavaScript/TypeScript applications using modern testing frameworks and best practices.

e2e-testing-patterns

31392
from sickn33/antigravity-awesome-skills

Build reliable, fast, and maintainable end-to-end test suites that provide confidence to ship code quickly and catch regressions before users do.