acp-harness

Unified ACP client and evaluation harness. Connect to ACP-compatible agents programmatically, capture full trajectories (tools, thoughts, plans), and pipe to downstream analysis tools.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

acp-harness is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Unified ACP client and evaluation harness. Connect to ACP-compatible agents programmatically, capture full trajectories (tools, thoughts, plans), and pipe to downstream analysis tools.

Teams using acp-harness should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/acp-harness/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/tools/acp-harness/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/acp-harness/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How acp-harness Compares

Feature / Agent	acp-harness	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Unified ACP client and evaluation harness. Connect to ACP-compatible agents programmatically, capture full trajectories (tools, thoughts, plans), and pipe to downstream analysis tools.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# ACP Harness

## Purpose

This skill provides a **unified toolkit for ACP client usage and agent evaluation**, optimized for TypeScript/JavaScript projects using Bun.

1. **ACP Client API** - Headless programmatic access to ACP-compatible agents
2. **Evaluation Harness** - Run prompts against agents and capture full trajectories

**Use this when:**
- Comparing skills across different agents (Claude Code, Cursor, OpenCode, Amp, Goose, Factory)
- Evaluating built-in tools vs MCP servers vs skills for the same task
- Generating training data with full trajectory capture
- Running regression tests in CI/CD pipelines
- Building multi-agent applications on a headless transport layer

## Foundation Use Cases

The harness is a **foundation layer** - it captures trajectories; scoring happens downstream.

```mermaid
flowchart LR
    Harness["ACP Harness"] -->|"trajectories"| Scoring["Braintrust / Custom Script"]
    Scoring -->|"scores"| Decision["Informed Choices"]
```

| Use Case | Harness Provides | You Build |
|----------|------------------|-----------|
| **Cross-agent skill eval** | Same prompts → multiple agents → trajectories | Scoring pipeline (Braintrust, custom) |
| **Tool comparison** | Trajectory with tool/skill attribution | Diff analysis, preference data |
| **Training data** | Structured I/O with tool calls, plans, thoughts | SFT/DPO formatting for world-agent |
| **Regression testing** | Deterministic prompt → trajectory capture | CI integration, golden file comparison |
| **Multi-agent apps** | `createACPClient` transport layer | Session management, UI, agent switching |

### Agents Supporting Skills

Skills can be installed across multiple agents, enabling cross-agent comparison:

| Agent | Skills Directory | Install Command |
|-------|------------------|-----------------|
| Claude Code | `.claude/skills/` | `./install-workshop.sh --agent claude` |
| Cursor | `.claude/skills/` | `./install-workshop.sh --agent cursor` |
| OpenCode | `.opencode/skill/` | `./install-workshop.sh --agent opencode` |
| Amp | `.agents/skills/` | `./install-workshop.sh --agent amp` |
| Goose | `.claude/skills/` | `./install-workshop.sh --agent goose` |
| Factory | `.factory/skills/` | `./install-workshop.sh --agent factory` |

### Example: Comparing Built-in vs Skill

```bash
# Run same prompt with built-in tool
bun scripts/run-harness.ts prompts.jsonl \
  --agent claude-code-acp \
  -o results-builtin.jsonl

# Run same prompt with custom skill installed
bun scripts/run-harness.ts prompts.jsonl \
  --agent claude-code-acp \
  --cwd /project/with/typescript-lsp-skill \
  -o results-skill.jsonl

# Compare trajectories - which used better tools? faster? more accurate?
diff <(jq '.toolCalls' results-builtin.jsonl) <(jq '.toolCalls' results-skill.jsonl)
```

## Execution Environment

**Recommendation:** Run evaluations in Docker containers for consistent, isolated execution.

```bash
# Build and run with Docker Compose
docker compose -f docker-compose.acp.yml run --rm acp-harness

# Or build directly
docker build -f Dockerfile.acp -t acp-harness .
docker run --rm -e ANTHROPIC_API_KEY acp-harness
```

Docker provides:
- Consistent environment across local and CI
- Filesystem isolation without app-level sandboxing
- Reproducible results for training data generation

See [assets/](assets/) for example container configurations:
- `Dockerfile.acp` - Base container with Bun and git
- `docker-compose.acp.yml` - Compose file with volume mounts for results

## Non-Goals

This harness is optimized for TypeScript/JavaScript projects using Bun. It is **not** designed for:

- **Python projects** - Use [SWE-bench](https://github.com/SWE-bench/SWE-bench), [Braintrust Python SDK](https://www.braintrust.dev/)
- **Academic model benchmarking** - Use [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- **IDE integrations** - Use Copilot Evaluation Harness
- **SaaS observability** - Use Braintrust, Langfuse platforms directly

## Quick Reference

| Resource | Description |
|----------|-------------|
| `scripts/run-harness.ts` | Execute prompts against agent, capture trajectories |
| [client-api.md](references/client-api.md) | `createACPClient` configuration, helpers |
| [output-formats.md](references/output-formats.md) | JSONL schemas, format options |
| [downstream.md](references/downstream.md) | Integration patterns (Braintrust, jq, LLM-as-judge) |
| [llm-judge-templates.md](references/llm-judge-templates.md) | Evaluation prompt templates |

## Evaluation Workflow

```mermaid
flowchart LR
    Prompts["prompts.jsonl"] --> Harness["run-harness.ts"]
    Agent["ACP Agent"] --> Harness
    Harness --> Summary["summary.jsonl"]
    Harness --> Judge["results.md + results.full.jsonl"]
    Summary --> JQ["jq analysis"]
    Judge --> LLM["LLM-as-judge"]
```

1. **Prepare** - Create `prompts.jsonl` with evaluation cases
2. **Execute** - Run harness against target agent
3. **Capture** - Trajectories streamed to output files
4. **Analyze** - Pipe to downstream tools for scoring

## Harness Script

### Basic Usage

```bash
bun scripts/run-harness.ts <prompts.jsonl> --agent <command> [options]
```

### Arguments

| Flag | Description | Default |
|------|-------------|---------|
| `prompts.jsonl` | Input file with evaluation prompts | Required |
| `-a, --agent` | ACP agent command | `"claude-code-acp"` |
| `-o, --output` | Output file/path | stdout |
| `-c, --cwd` | Working directory for agent | current |
| `-t, --timeout` | Request timeout in ms | `60000` |
| `-f, --format` | Output format: `summary`, `judge` | `summary` |
| `--progress` | Show progress to stderr | false |
| `--append` | Append to output file | false |
| `--mcp-server` | MCP server config JSON (repeatable) | none |

### Examples

```bash
# Summary format (default) - minimal JSONL
bun scripts/run-harness.ts prompts.jsonl -o results.jsonl

# Judge format - creates two files for two-tier evaluation
bun scripts/run-harness.ts prompts.jsonl --format judge -o results
# Creates: results.md (summary with step IDs) + results.full.jsonl (complete trajectory)

# With MCP server (stdio transport)
bun scripts/run-harness.ts prompts.jsonl \
  --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem","/data"]}'

# With MCP server (HTTP transport)
bun scripts/run-harness.ts prompts.jsonl \
  --mcp-server '{"type":"http","name":"api","url":"http://localhost:3000"}'

# Different agent (Droid ACP adapter)
bun scripts/run-harness.ts prompts.jsonl --agent droid-acp -o results.jsonl

# Stream with progress
bun scripts/run-harness.ts prompts.jsonl --progress -o results.jsonl
```

## Input Format

Each line in `prompts.jsonl`:

```jsonl
{"id":"test-001","input":"Create a primary button","expected":"should contain <button>","metadata":{"category":"ui"}}
{"id":"test-002","input":"Write a bThread for form validation","metadata":{"category":"behavioral"}}
```

| Field | Required | Description |
|-------|----------|-------------|
| `id` | Yes | Unique identifier |
| `input` | Yes | Prompt text for the agent |
| `expected` | No | Expected output (for downstream scoring) |
| `metadata` | No | Tags, category, difficulty for filtering |
| `timeout` | No | Override default timeout for this prompt |

## Output Formats

### Summary Format (default)

Minimal JSONL for quick metrics and analysis:

```jsonl
{"id":"test-001","input":"Create a button","output":"I created...","toolCalls":["Write"],"status":"passed","duration":1234}
```

### Judge Format (two-tier)

Creates two files for LLM-as-judge evaluation:

**`<output>.md`** - Markdown summary with step IDs and code previews:

```markdown
## Evaluation Record: test-001

**Input:** Create a primary button

**Trajectory:**
1. [THOUGHT] I'll create a styled button template... [->test-001-step-1]
2. [TOOL:Write] -> completed (234ms) [->test-001-step-2]
   File: src/button.tsx (847 chars)
   ```tsx
   import { createStyles } from 'plaited'

   type ButtonProps = {
     label: string

   // ... 30 lines omitted ...

   export const Button = ({ label }: ButtonProps) => (
     <button {...styles.btn}>{label}</button>
   )
   ```
3. [MESSAGE] I created the button... [->test-001-step-3]

**Output:** I created the button template with primary styling.
**Metadata:** category=ui, agent=claude-code-acp
**Status:** passed
**Duration:** 1234ms

---
```

**`<output>.full.jsonl`** - Complete trajectory with step IDs for correlation:

```jsonl
{"id":"test-001","input":"...","output":"...","trajectory":[{"type":"thought","content":"...","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{...},"output":{...},"duration":234,"stepId":"test-001-step-2"}],...}
```

**Usage patterns by judge context window:**

| Judge Model | Strategy |
|-------------|----------|
| Gemini (1M+ tokens) | Feed `results.full.jsonl` directly |
| Claude/GPT-4 (128-200k) | Use `results.full.jsonl` for most runs |
| Smaller models | Use `results.md`, retrieve specific steps by ID as needed |

## Programmatic Usage

```typescript
import { createACPClient, createPrompt, summarizeResponse } from 'plaited/acp'

// Requires: npm install -g @zed-industries/claude-code-acp
const client = createACPClient({
  command: ['claude-code-acp'],
  cwd: '/path/to/project',
})

await client.connect()
const session = await client.createSession()

const { updates } = await client.promptSync(
  session.id,
  createPrompt('Create a button with hover state')
)

// Full trajectory is in updates
const summary = summarizeResponse(updates)
console.log({
  text: summary.text,
  toolCalls: summary.completedToolCalls,
  hasErrors: summary.hasErrors
})

await client.disconnect()
```

See [client-api.md](references/client-api.md) for complete API documentation.

## Downstream Integration

The harness outputs standard JSONL that pipes to any tool:

```bash
# Filter with jq
cat results.jsonl | jq 'select(.metadata.category == "ui")'

# Count tool usage
cat results.jsonl | jq -s 'map(.toolCalls | length) | add'

# Feed full trajectory to Gemini (large context)
cat results.full.jsonl | your-gemini-judge.ts
```

See [downstream.md](references/downstream.md) for integration patterns with Braintrust, Gemini, and custom scorers.

## Evaluation Targets

| Target | How to Evaluate |
|--------|-----------------|
| **Agent capability** | Direct prompts, analyze trajectory quality |
| **Skills** | Set `--cwd` to project with skill, test skill-specific prompts |
| **MCP Servers** | Use `--mcp-server` flag, verify tool usage in trajectory |
| **Behavioral programs** | Analyze trajectory for bThread coordination patterns |

### Skill Evaluation

```bash
bun scripts/run-harness.ts skill-prompts.jsonl \
  --cwd /project/with/skill \
  -o results.jsonl
```

### MCP Server Evaluation

```bash
bun scripts/run-harness.ts mcp-prompts.jsonl \
  --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem"]}' \
  -o results.jsonl
```

## Related

- **plaited/acp** - Core ACP client module
- **world-agent** - Training workflow, trajectory generation
- **plaited-behavioral-core** - bThread patterns for agent coordination

Related Skills

harness-delivery-iac

from diegosouzapw/awesome-omni-skill

Use this skill for Harness CI/CD, IaCM, templates, connectors, integrations, pipelines, pull requests, APIs, feature flags, internal developer portal, and governance workflows.

harness-ci

from diegosouzapw/awesome-omni-skill

Harness CI (Continuous Integration) for container-native builds, test intelligence, caching, parallelization, and build infrastructure management. Activate for build pipelines, CI steps, test automation, artifact publishing, and build optimization.

harness-cd

from diegosouzapw/awesome-omni-skill

Harness CD (Continuous Delivery) for Kubernetes, Helm, Terraform, ECS, and serverless deployments with GitOps, approval gates, rollback strategies, and multi-environment promotion

harness-coder

from diegosouzapw/awesome-omni-skill

Main coding agent for autonomous harness system. Continues work from previous sessions, implements one feature per session, coordinates with testing and review skills, and maintains clean handoffs via Archon. This is the workhorse of the harness system for multi-session development.

copilot-tui-harness

from diegosouzapw/awesome-omni-skill

Expert in the Copilot SDK TUI Harness project. Use for development tasks including architecture, event system, plugins, OpenTUI components, and Copilot SDK integration. Triggers on TUI development, harness events, streaming UI, plugin system, event-driven architecture.

bgo

from diegosouzapw/awesome-omni-skill

Automated Blender build-go workflow. Automatically builds, removes old version, installs, enables, and launches Blender with your extension/add-on. Use when you want to quickly test changes, execute complete build-to-launch cycle, or run custom packaging scripts with automatic Blender launch.

Coding & Development

maintenance

from diegosouzapw/awesome-omni-skill

Cleans up and organizes project files. Use when user mentions '整理', 'cleanup', 'アーカイブ', 'archive', '肥大化', 'Plans.md', 'session-log', or asks to clean up old tasks, archive completed items, or organize files. Do NOT load for: 実装作業, レビュー, 新機能開発, デプロイ.

hello-skill

from diegosouzapw/awesome-omni-skill

每次对话开始时,声明"[Skills✏️已加载]"

zylvie-automation

from diegosouzapw/awesome-omni-skill

Automate Zylvie tasks via Rube MCP (Composio). Always search tools first for current schemas.

zoominfo-automation

from diegosouzapw/awesome-omni-skill

Automate Zoominfo tasks via Rube MCP (Composio). Always search tools first for current schemas.

zoho-invoice-automation

from diegosouzapw/awesome-omni-skill

Automate Zoho Invoice tasks via Rube MCP (Composio): invoices, estimates, expenses, clients, and payment tracking. Always search tools first for current schemas.

zoho-inventory-automation

from diegosouzapw/awesome-omni-skill

Automate Zoho Inventory tasks via Rube MCP (Composio): items, orders, warehouses, shipments, and stock management. Always search tools first for current schemas.