meta-harness-optimization
Framework for automated search over task-specific model harnesses — the code around a fixed base model that decides what to store, retrieve, and show while the model works.
Best use case
meta-harness-optimization is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Framework for automated search over task-specific model harnesses — the code around a fixed base model that decides what to store, retrieve, and show while the model works.
Teams using meta-harness-optimization should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/meta-harness-optimization/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How meta-harness-optimization Compares
| Feature / Agent | meta-harness-optimization | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Framework for automated search over task-specific model harnesses — the code around a fixed base model that decides what to store, retrieve, and show while the model works.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Meta-Harness Optimization
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
Meta-Harness is a framework for automated end-to-end search over **model harnesses** — the scaffolding code around a fixed base model that controls what the model stores, retrieves, and sees while working on a task. Rather than hand-crafting prompts and memory systems, Meta-Harness proposes, evaluates, and evolves harness implementations automatically.
**Paper**: [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052)
**Homepage**: https://yoonholee.com/meta-harness/
## Core Concepts
| Term | Meaning |
|------|---------|
| **Harness** | All code around the base model: memory, retrieval, prompt construction, tool use |
| **Proposer Agent** | LLM (e.g. Claude Code) that proposes new harness variants |
| **Evaluator** | Runs proposed harnesses on a benchmark, returns a score |
| **Meta-Loop** | Iterative propose → evaluate → feedback cycle |
## Installation
Meta-Harness uses `uv` for dependency management. Each reference experiment is self-contained:
```bash
# Text classification experiment
cd reference_examples/text_classification
uv sync
# Terminal-Bench 2 experiment
cd reference_examples/terminal_bench_2
uv sync
```
No global pip install is needed. All dependencies are managed per-experiment via `pyproject.toml`.
## Quick Start
### Text Classification (Memory System Search)
```bash
cd reference_examples/text_classification
# Run 1 iteration of meta-harness optimization
uv run python meta_harness.py --iterations 1
# Run more iterations for better optimization
uv run python meta_harness.py --iterations 10
```
### Terminal-Bench 2 (Scaffold Evolution)
```bash
cd reference_examples/terminal_bench_2
# Smoke test with a single task
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf
# General eval format:
# run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]
```
## Applying Meta-Harness to a New Domain
The recommended workflow uses the onboarding document with your AI coding assistant:
```bash
# 1. Open ONBOARDING.md in your coding assistant (Claude Code, Cursor, etc.)
# and have a conversation about your domain. This produces domain_spec.md.
# 2. domain_spec.md will contain:
# - What the harness controls in your domain
# - How to evaluate harness quality (benchmark / metric)
# - What the proposer agent should modify
# - Constraints and budget considerations
```
### Minimum Required Components for a New Domain
```
my_domain/
├── pyproject.toml # uv-managed dependencies
├── domain_spec.md # generated via ONBOARDING.md conversation
├── meta_harness.py # main optimization loop
├── harness.py # base harness implementation
├── evaluator.py # benchmark runner → numeric score
└── claude_wrapper.py # proposer agent wrapper
```
## Implementing a Harness
A harness wraps a base model and manages context/memory/tools:
```python
# harness.py — minimal harness structure
from dataclasses import dataclass, field
from typing import Any
@dataclass
class HarnessConfig:
model: str = "claude-3-5-sonnet-20241022"
memory_strategy: str = "last_k"
k: int = 5
retrieval_enabled: bool = False
system_prompt: str = "You are a helpful assistant."
class AgentHarness:
def __init__(self, config: HarnessConfig):
self.config = config
self.memory: list[dict] = []
def reset(self):
self.memory = []
def _build_context(self, new_input: str) -> list[dict]:
"""Core harness logic: what does the model see?"""
if self.config.memory_strategy == "last_k":
recent = self.memory[-self.config.k:]
elif self.config.memory_strategy == "all":
recent = self.memory[:]
else:
recent = []
return recent + [{"role": "user", "content": new_input}]
def step(self, user_input: str) -> str:
messages = self._build_context(user_input)
# Call base model with constructed context
response = call_model(
model=self.config.model,
system=self.config.system_prompt,
messages=messages
)
# Update memory
self.memory.append({"role": "user", "content": user_input})
self.memory.append({"role": "assistant", "content": response})
return response
```
## Implementing the Evaluator
```python
# evaluator.py — runs harness on benchmark, returns score
from harness import AgentHarness, HarnessConfig
def evaluate_harness(config: HarnessConfig, dataset: list[dict]) -> float:
"""
Evaluate a harness configuration on a dataset.
Returns a scalar score (higher is better).
"""
harness = AgentHarness(config)
correct = 0
for example in dataset:
harness.reset()
prediction = harness.step(example["input"])
if grade(prediction, example["label"]):
correct += 1
return correct / len(dataset)
def grade(prediction: str, label: str) -> bool:
"""Task-specific grading logic."""
return label.lower().strip() in prediction.lower()
```
## The Meta-Harness Loop
```python
# meta_harness.py — the optimization loop
import json
from pathlib import Path
from evaluator import evaluate_harness
from claude_wrapper import run_proposer
def meta_harness_loop(
iterations: int = 10,
train_dataset: list = None,
val_dataset: list = None,
):
history: list[dict] = []
best_score = 0.0
best_config = None
for i in range(iterations):
print(f"\n=== Iteration {i+1}/{iterations} ===")
# 1. Propose: ask the proposer agent for a new harness variant
proposal = run_proposer(
history=history,
task_description="Optimize the memory system for text classification.",
code_context=Path("harness.py").read_text(),
)
# 2. Evaluate: run the proposed harness
try:
new_config = parse_proposal(proposal)
score = evaluate_harness(new_config, train_dataset)
except Exception as e:
score = 0.0
print(f"Evaluation failed: {e}")
# 3. Record: log result for proposer feedback
record = {
"iteration": i + 1,
"proposal": proposal,
"score": score,
}
history.append(record)
print(f"Score: {score:.4f}")
if score > best_score:
best_score = score
best_config = new_config
print(f"New best: {best_score:.4f}")
# Final validation on held-out set
if best_config and val_dataset:
val_score = evaluate_harness(best_config, val_dataset)
print(f"\nFinal val score: {val_score:.4f}")
return best_config, history
```
## Proposer Agent Wrapper (Claude Code)
The shipped examples use Claude Code as the proposer. Adapt `claude_wrapper.py`:
```python
# claude_wrapper.py — wraps proposer agent calls
import subprocess
import json
from pathlib import Path
def run_proposer(
history: list[dict],
task_description: str,
code_context: str,
) -> str:
"""
Call Claude Code (or another proposer) to suggest harness modifications.
Logs all interactions for reproducibility.
"""
prompt = build_proposer_prompt(history, task_description, code_context)
# Example: call Claude via API
import anthropic
client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY env var
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
result = response.content[0].text
# Log for reproducibility
log_entry = {"prompt": prompt, "response": result}
with open("proposer_log.jsonl", "a") as f:
f.write(json.dumps(log_entry) + "\n")
return result
def build_proposer_prompt(
history: list[dict],
task_description: str,
code_context: str,
) -> str:
history_str = "\n".join(
f"Iteration {h['iteration']}: score={h['score']:.4f}\nProposal:\n{h['proposal']}"
for h in history[-5:] # last 5 for context window
)
return f"""You are optimizing a model harness for: {task_description}
Current harness code:
```python
{code_context}
```
Optimization history (recent):
{history_str if history_str else "No history yet — this is the first iteration."}
Propose a modified HarnessConfig or changes to the harness code that may improve performance.
Output your proposal as a JSON config dict, followed by any code changes.
"""
```
## Environment Variables
```bash
# Required for Claude-based proposer
export ANTHROPIC_API_KEY=your_key_here
# Optional: control model used
export PROPOSER_MODEL=claude-opus-4-5
export EVALUATOR_MODEL=claude-3-5-sonnet-20241022
```
## Reference Experiment Structure
### Text Classification (`reference_examples/text_classification/`)
Searches over memory system configurations for a classification task:
- Proposer modifies memory strategy, retrieval settings, prompt templates
- Evaluator scores on held-out classification benchmark
- Optimized config is saved for reuse
```bash
uv run python meta_harness.py --iterations 20 --dataset ag_news
```
### Terminal-Bench 2 (`reference_examples/terminal_bench_2/`)
Evolves agent scaffolding for computer-use / terminal tasks:
```bash
# Run baseline agent on a specific task
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf
# Arguments: <module:Class> <split> <num_tasks> <num_workers> [task_filter]
# Optimized artifact: stanford-iris-lab/meta-harness-tbench2-artifact
```
## Common Patterns
### Saving and Loading Optimized Configs
```python
import json
from dataclasses import asdict
# Save
with open("best_config.json", "w") as f:
json.dump(asdict(best_config), f, indent=2)
# Load
with open("best_config.json") as f:
data = json.load(f)
config = HarnessConfig(**data)
```
### Adding Early Stopping
```python
PATIENCE = 3
no_improve = 0
for i in range(iterations):
score = evaluate_harness(config, dataset)
if score > best_score + 1e-4:
best_score = score
no_improve = 0
else:
no_improve += 1
if no_improve >= PATIENCE:
print(f"Early stop at iteration {i+1}")
break
```
### Parallel Evaluation
```python
from concurrent.futures import ProcessPoolExecutor
def batch_evaluate(configs, dataset, num_workers=4):
with ProcessPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(evaluate_harness, c, dataset) for c in configs]
return [f.result() for f in futures]
```
## Troubleshooting
| Problem | Likely Cause | Fix |
|---------|-------------|-----|
| `uv sync` fails | Missing Python version | Install Python 3.11+ via `pyenv` |
| Proposer returns unparseable JSON | Prompt too vague | Add explicit JSON schema to proposer prompt |
| Scores don't improve | Too few iterations or search space too large | Increase `--iterations`, narrow config space |
| API rate limits | Too many evaluator calls | Add `time.sleep()` or batch requests |
| Claude Code not found | CLI not installed | `npm install -g @anthropic-ai/claude-code` |
## Citation
```bibtex
@misc{lee2026metaharnessendtoendoptimizationmodel,
title={Meta-Harness: End-to-End Optimization of Model Harnesses},
author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn},
year={2026},
eprint={2603.28052},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.28052},
}
```Related Skills
text-to-cad-harness
Open source harness for generating 3D CAD models from text using AI coding agents with build123d/OpenCascade, exporting STEP/STL/URDF, and previewing in a local CAD Explorer viewer.
metatron-pentest-assistant
AI-powered penetration testing assistant using local LLM (metatron-qwen via Ollama) on Parrot OS Linux
metaclaw-evolving-agent
Deploy and configure MetaClaw — an agent that meta-learns and evolves from live conversations using skills injection, RL training, and smart scheduling.
everything-claude-code-harness
Agent harness performance system for Claude Code and other AI coding agents — skills, instincts, memory, hooks, commands, and security scanning
claw-code-harness
Better Harness Tools for Claude Code — a Python (and in-progress Rust) rewrite of the Claude Code agent harness, with CLI tooling for manifest inspection, parity auditing, and tool/command inventory.
```markdown
---
zeroboot-vm-sandbox
Sub-millisecond VM sandboxes for AI agents using copy-on-write KVM forking via Zeroboot
yourvpndead-vpn-detection
Android app that detects VPN/proxy servers (VLESS/xray/sing-box) via local SOCKS5 vulnerability, exposing exit IPs and server configs without root
xata-postgres-platform
Expert skill for Xata open-source cloud-native Postgres platform with copy-on-write branching, scale-to-zero, and Kubernetes deployment
x-mentor-skill-nuwa
AI-powered X (Twitter) content strategy skill that distills methodologies from 6 top creators + open-source algorithm data into actionable writing, growth, and monetization guidance.
wx-favorites-report
End-to-end pipeline to extract, decrypt, and visualize WeChat Mac favorites from encrypted SQLite DB into an interactive HTML report.
wterm-web-terminal
Web terminal emulator with Zig/WASM core, DOM rendering, and React/vanilla JS bindings