agent-eval

编码代理(Claude Code、Aider、Codex等)在自定义任务上的直接比较,包含通过率、成本、时间和一致性指标

144,923 stars
Complexity: easy

About this skill

The `agent-eval` skill provides a robust, data-driven framework for objectively benchmarking AI coding agents. Instead of relying on subjective 'which agent is best?' comparisons, this skill offers a systematic approach using a lightweight CLI tool. Users can define tasks declaratively in YAML, specifying what needs to be done, which files to modify, and how success should be judged. AI agents then attempt these tasks, and `agent-eval` collects crucial performance metrics such as pass rate, execution cost, time taken, and result consistency. This enables reproducible and quantifiable evaluations, making it an invaluable tool for making informed decisions about agent adoption, monitoring performance regressions, or comparing new models and tools.

Best use case

Comparing AI coding agents on custom codebases; evaluating agent performance before adopting new tools or models; running regression checks when agents update their models or tools; making data-backed agent selection decisions for teams.

编码代理(Claude Code、Aider、Codex等)在自定义任务上的直接比较,包含通过率、成本、时间和一致性指标

A clear, data-backed comparison report detailing the performance of various AI coding agents across your defined custom tasks. The report will include metrics such as pass rate, cost, time, and consistency, enabling informed decision-making regarding agent selection and usage.

Practical example

Example input

```yaml
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests in src/http_client.py. Ensure it handles transient errors and respects a maximum retry limit.
```

Example output

```
--- Agent Evaluation Report ---
Task: add-retry-logic

Agent: Claude Code
  Pass Rate: 85%
  Average Cost: $0.15
  Average Time: 120s
  Consistency Score: 0.9

Agent: Aider
  Pass Rate: 70%
  Average Cost: $0.10
  Average Time: 90s
  Consistency Score: 0.75

Agent: Codex
  Pass Rate: 60%
  Average Cost: $0.12
  Average Time: 110s
  Consistency Score: 0.8
-------------------------------
```
(Output can also be generated as JSON or CSV for further analysis)

When to use this skill

  • When you need to objectively compare the performance of different AI coding agents (e.g., Claude Code, Aider, Codex) on specific, reproducible coding tasks within your own codebase. This skill is ideal for pre-adoption evaluation, post-update regression testing, or making data-driven decisions when selecting an agent for your team or project.

When not to use this skill

  • When you only need a simple, quick coding assist for a single task without the need to benchmark agent performance; when you are not working with multiple coding agents or planning to switch between them; or when subjective preference or anecdotal evidence is sufficient for your agent selection needs.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/agent-eval/SKILL.md --create-dirs "https://raw.githubusercontent.com/affaan-m/everything-claude-code/main/docs/zh-CN/skills/agent-eval/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/agent-eval/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How agent-eval Compares

Feature / Agentagent-evalStandard Approach
Platform SupportClaudeLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

编码代理(Claude Code、Aider、Codex等)在自定义任务上的直接比较,包含通过率、成本、时间和一致性指标

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Agent Eval 技能

一个轻量级 CLI 工具,用于在可复现的任务上对编码代理进行头对头比较。每个“哪个编码代理最好?”的比较都基于感觉——本工具将其系统化。

## 何时使用

* 在你自己的代码库上比较编码代理(Claude Code、Aider、Codex 等)
* 在采用新工具或模型之前衡量代理性能
* 当代理更新其模型或工具时运行回归检查
* 为团队做出数据支持的代理选择决策

## 安装

```bash
# pinned to v0.1.0 — latest stable commit
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
```

## 核心概念

### YAML 任务定义

以声明方式定义任务。每个任务指定要做什么、要修改哪些文件以及如何判断成功:

```yaml
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility
```

### Git 工作树隔离

每个代理运行都获得自己的 git 工作树——无需 Docker。这提供了可复现的隔离,使得代理之间不会相互干扰或损坏基础仓库。

### 收集的指标

| 指标 | 衡量内容 |
|--------|-----------------|
| 通过率 | 代理生成的代码是否通过了判断? |
| 成本 | 每个任务的 API 花费(如果可用) |
| 时间 | 完成所需的挂钟秒数 |
| 一致性 | 跨重复运行的通过率(例如,3/3 = 100%) |

## 工作流程

### 1. 定义任务

创建一个 `tasks/` 目录,其中包含 YAML 文件,每个任务一个文件:

```bash
mkdir tasks
# Write task definitions (see template above)
```

### 2. 运行代理

针对你的任务执行代理:

```bash
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
```

每次运行:

1. 从指定的提交创建一个新的 git 工作树
2. 将提示交给代理
3. 运行判断标准
4. 记录通过/失败、成本和时间

### 3. 比较结果

生成比较报告:

```bash
agent-eval report --format table
```

```
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘
```

## 判断类型

### 基于代码(确定性)

```yaml
judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build
```

### 基于模式

```yaml
judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py
```

### 基于模型(LLM 作为判断器)

```yaml
judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.
```

## 最佳实践

* **从 3-5 个任务开始**,这些任务代表你的真实工作负载,而非玩具示例
* **每个代理至少运行 3 次试验**以捕捉方差——代理是非确定性的
* **在你的任务 YAML 中固定提交**,以便结果在数天/数周内可复现
* **每个任务至少包含一个确定性判断器**(测试、构建)——LLM 判断器会增加噪音
* **跟踪成本与通过率**——一个通过率 95% 但成本高出 10 倍的代理可能不是正确的选择
* **对你的任务定义进行版本控制**——它们是测试夹具,应将其视为代码

## 链接

* 仓库:[github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval)

Related Skills

iterative-retrieval

144923
from affaan-m/everything-claude-code

サブエージェントのコンテキスト問題を解決するために、コンテキスト取得を段階的に洗練するパターン

DevelopmentClaude

eval-harness

144923
from affaan-m/everything-claude-code

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

DevelopmentClaude

workspace-surface-audit

144923
from affaan-m/everything-claude-code

Audit the active repo, MCP servers, plugins, connectors, env surfaces, and harness setup, then recommend the highest-value ECC-native skills, hooks, agents, and operator workflows. Use when the user wants help setting up Claude Code or understanding what capabilities are actually available in their environment.

DevelopmentClaude

safety-guard

144923
from affaan-m/everything-claude-code

Use this skill to prevent destructive operations when working on production systems or running agents autonomously.

DevelopmentClaude

repo-scan

144923
from affaan-m/everything-claude-code

Cross-stack source code asset audit — classifies every file, detects embedded third-party libraries, and delivers actionable four-level verdicts per module with interactive HTML reports.

DevelopmentClaude

project-flow-ops

144923
from affaan-m/everything-claude-code

Operate execution flow across GitHub and Linear by triaging issues and pull requests, linking active work, and keeping GitHub public-facing while Linear remains the internal execution layer. Use when the user wants backlog control, PR triage, or GitHub-to-Linear coordination.

DevelopmentClaude

manim-video

144923
from affaan-m/everything-claude-code

Build reusable Manim explainers for technical concepts, graphs, system diagrams, and product walkthroughs, then hand off to the wider ECC video stack if needed. Use when the user wants a clean animated explainer rather than a generic talking-head script.

DevelopmentClaude

laravel-plugin-discovery

144923
from affaan-m/everything-claude-code

Discover and evaluate Laravel packages via LaraPlugins.io MCP. Use when the user wants to find plugins, check package health, or assess Laravel/PHP compatibility.

DevelopmentClaude

design-system

144923
from affaan-m/everything-claude-code

Use this skill to generate or audit design systems, check visual consistency, and review PRs that touch styling.

DevelopmentClaude

click-path-audit

144923
from affaan-m/everything-claude-code

Trace every user-facing button/touchpoint through its full state change sequence to find bugs where functions individually work but cancel each other out, produce wrong final state, or leave the UI in an inconsistent state. Use when: systematic debugging found no bugs but users report broken buttons, or after any major refactor touching shared state stores.

DevelopmentClaude

ck

144923
from affaan-m/everything-claude-code

Persistent per-project memory for Claude Code. Auto-loads project context on session start, tracks sessions with git activity, and writes to native memory. Commands run deterministic Node.js scripts — behavior is consistent across model versions.

DevelopmentClaude

canary-watch

144923
from affaan-m/everything-claude-code

Use this skill to monitor a deployed URL for regressions after deploys, merges, or dependency upgrades.

DevelopmentClaude