autoforge

AutoForge is a production-grade autonomous optimization framework for AI agents. It replaces subjective "reflection" with mathematically rigorous convergence loops — tracking every iteration in TSV, cross-validating with multiple models, and stopping only when pass rates confirm real improvement. Four specialized modes: prompt (skill & doc optimization via scenario simulation), code (sandboxed test execution with measurable criteria), audit (CLI verification against live tool behavior), and project (whole-repo cross-file consistency analysis). Battle-tested across 50+ iterations on production skills. Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".

3,880 stars

byopenclaw

View on GitHub Installation ↓

Best use case

autoforge is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using autoforge should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/autoforge/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/akrimm702/autoforge/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/autoforge/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How autoforge Compares

Feature / Agent	autoforge	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

Best AI Skills for ChatGPT

Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.

SKILL.md Source

# AutoForge — Autonomous Optimization Framework

> Stop reflecting. Start converging. Every iteration is measured, logged, and validated — not vibed.

AutoForge replaces ad-hoc "improve this" prompts with a rigorous optimization loop: define evals, run iterations, track pass rates in TSV, report live to your channel, and stop only when math says you're done. Multi-model cross-validation prevents the "same model grades its own homework" blind spot.

**Four modes. One convergence standard.**

| Mode | What it does | Best for |
|------|-------------|----------|
| `prompt` | Simulate 5 scenarios/iter, evaluate Yes/No | SKILL.md, prompts, doc templates |
| `code` | Sandboxed test execution, measure exit/stdout/stderr | Shell scripts, Python tools, pipelines |
| `audit` | Test CLI commands live, verify SKILL.md matches reality | CLI skill documentation |
| `project` | Scan whole repo, cross-file consistency analysis | README↔CLI drift, Dockerfile↔deps, CI gaps |

---

# AutoForge — Top-Agent Architecture

## Overview

```
Agent (you)
├── State: results.tsv, current target file state, iteration counter
├── Iteration 1: evaluate → improve → write TSV → report
├── Iteration 2: evaluate → improve → write TSV → report
├── ...
└── Finish: report.sh --final → configured channel
```

### Sub-Agent = You
"Sub-Agent" is a **conceptual role**, not a separate process. You (the top-agent) execute each iteration yourself: simulate/execute → evaluate → write TSV → call report.sh. The templates below describe what you do PER ITERATION — not what you send to another agent.

For code mode, run tests using the `exec` tool.

### Multi-Model Setup (recommended for Deep Audits)

For complex audits, you can **split two roles across different models**:

| Role | Model | Task |
|------|-------|------|
| **Optimizer** | Opus / GPT-4.1 | Analyzes, finds issues, writes fixes |
| **Validator** | GPT-5 / Gemini (different model) | Checks against ground truth, provides pass rate |

**Flow:** Optimizer and Validator alternate. Optimizer iterations have status `improved`/`retained`/`discard`. Validator iterations confirm or refute the pass rate. Spawn validators as sub-agents with `sessions_spawn` and explicit `model`.

**When to use Multi-Model:** Deep Audits (>5 iterations expected), complex ground truth, or when a single model is blind to its own errors.

**When Single-Model suffices:** Simple CLI audits, prompt optimization, code with clear tests.

---

## Configuration

AutoForge uses environment variables for reporting. All are optional — without them, output goes to stdout.

| Variable | Default | Description |
|----------|---------|-------------|
| `AF_CHANNEL` | `telegram` | Messaging channel for reports |
| `AF_CHAT_ID` | _(none)_ | Chat/group ID for report delivery |
| `AF_TOPIC_ID` | _(none)_ | Thread/topic ID within the chat |

---

## Hard Invariants

These rules apply **always**, regardless of mode:

1. **TSV is mandatory.** Every iteration writes exactly one row to `results/[target]-results.tsv`.
2. **Reporting is mandatory.** Call `report.sh` immediately after every TSV row.
3. **--dry-run never overwrites the target.** Only TSV, `*-proposed.md`, and reports are written.
4. **Mode isolation is strict.** Only execute steps for the assigned mode.
5. **Iteration 1 = Baseline.** Evaluate the original version unchanged, status `baseline`.

---

## Modes — Read ONLY Your Mode!

You are assigned ONE mode. **Ignore all sections for other modes.**

| Mode | What happens | Output |
|------|-------------|--------|
| `prompt` | Mentally simulate skill/prompt, evaluate against evals | Improved prompt text |
| `code` | Run tests in sandbox, measure results | Improved code |
| `audit` | Test CLI commands (read-only only!) + verify SKILL.md against reality | Improved SKILL.md |
| `project` | Scan whole repo, cross-file analysis, fix multiple files per iteration | Improved repository |

**Your mode is in the task prompt.** Everything else is irrelevant to you.

---

## TSV Format (same for ALL modes)

### Header (once at loop start):
```bash
printf '%s\t%s\t%s\t%s\t%s\n' "iteration" "prompt_version_summary" "pass_rate" "change_description" "status" > results/[target]-results.tsv
```

### Row per iteration:
```bash
printf '%s\t%s\t%s\t%s\t%s\n' "1" "Baseline" "58%" "Original version" "baseline" >> results/[target]-results.tsv
```
> **Use `printf` not `echo -e`!** `echo -e` interprets backslashes in field values. `printf '%s'` outputs strings literally.

### 5 columns, TAB-separated, EXACTLY this order:

| # | Column | Type | Rules |
|---|--------|------|-------|
| 1 | `iteration` | Integer | 1, 2, 3, ... |
| 2 | `prompt_version_summary` | String | Max 50 Unicode chars. No tabs, no newlines. |
| 3 | `pass_rate` | String | Number + `%`: `58%`, `92%`, `100%`. Always integer. |
| 4 | `change_description` | String | Max 100 Unicode chars. No tabs, no newlines. |
| 5 | `status` | Enum | Exactly one of: `baseline` · `improved` · `retained` · `discard` |

### Escaping rules:
- **Tabs** in text fields → replace with spaces
- **Newlines** in text fields → replace with ` | `
- **Empty fields** → use hyphen `-` (never leave empty)
- **`$` and backticks** → use `printf '%s'` or escape with `\$` (prevents unintended variable interpolation)
- **Unicode/Emoji** allowed, count as 1 character (not bytes)

### Status rules (based on pass-rate comparison):
- `baseline` — **Mandatory for Iteration 1.** Evaluate original version only.
- `improved` — Pass rate **higher** than previous best → new version becomes current state
- `retained` — Pass rate **equal or marginally better** → predecessor remains
- `discard` — Pass rate **lower** → change discarded, revert to best state

---

## Reporting (same for ALL modes)

**After EVERY TSV row** (including baseline):
```bash
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]"
```

**After loop ends**, additionally with `--final`:
```bash
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]" --final
```

The report script reads `AF_CHANNEL`, `AF_CHAT_ID`, and `AF_TOPIC_ID` from environment. Without them, it prints to stdout with ANSI colors.

---

## Stop Conditions (for ALL modes)

Priority — first matching condition wins, top to bottom:

1. 🛑 **Minimum iterations** — If specified in task (e.g. "min 5"), this count MUST be reached. No other condition can stop before.
2. 🛑 **Max 30 iterations** — Hard safety net, stop immediately.
3. ❌ **3× `discard` in a row** → structural problem, stop + analyze.
4. ✅ **3× 100% pass rate** (after minimum) → confirmed perfect, done.
5. ➡️ **5× `retained` in a row** → converged, done.

### Counting rules:
- `3× 100%` = three iterations with `pass_rate == 100%`, not necessarily consecutive.
- `5× retained` and `3× discard` = **consecutive** (in a row).
- `baseline` counts toward no series.
- `improved` interrupts `retained` and `discard` series.

**At 100% in early iterations:** Keep going! Test harder edge cases. Only 3× 100% *after the minimum* confirms true perfection.

### Recognizing Validator Noise

In multi-model setups, the Validator can produce **false positives** — fails that aren't real issues:

- **Config path vs tool name confusion** (e.g. `agents.list[]` ≠ `agents_list` tool)
- **Inverted checks** ("no X" → Validator looks for X as required)
- **Normal English as forbidden reference** (e.g. "runtime outcome" ≠ `runtime: "acp"`)
- **Overcounting** (thread commands counted as subagent commands)

**Rule:** If after all real fixes >3 discards come in a row and the fail justifications don't hold up under scrutiny → **declare convergence**, don't validate endlessly.

---

## Execution Modes

| Flag | Behavior |
|------|----------|
| `--dry-run` (default) | Only TSV + proposed files. Target file/repo remains unchanged. |
| `--live` | Target file/repo is overwritten. Auto-backup → `results/backups/` |
| `--resume` | Read existing TSV, continue from last iteration. On invalid format: abort. |

---

## mode: prompt

> **Only read if your task contains `mode: prompt`!**

### Per Iteration: What you do
1. Read current prompt/skill
2. **Mentally simulate 5 different realistic scenarios**
3. Evaluate each scenario against **all evals** (Yes=1, No=0)
4. Pass rate = (Sum Yes) / (Eval count × 5 scenarios) × 100
5. Compare with best previous pass rate → determine status
6. On `improved`: propose **minimal, surgical** improvement
7. Write TSV row + call report.sh
8. Check stop conditions

### At the End
Best version → `results/[target]-proposed.md` + report.sh `--final`

---

## mode: code

> **Only read if your task contains `mode: code`!**

### Per Iteration: What you do
1. Create sandbox: `SCRATCH=$(mktemp -d) && cd $SCRATCH`
2. Write current code to sandbox
3. Execute test command (with `timeout 60s`)
4. Measure: exit_code, stdout, stderr, runtime
5. Evaluate against evals → calculate pass rate
6. On `improved`: minimal code improvement + verify again
7. Write TSV row + call report.sh
8. Check stop conditions

### Code Eval Types

| Eval Type | Description | Example |
|-----------|-------------|---------|
| `exit_code` | Process exit code | `exit_code == 0` |
| `output_contains` | stdout contains string | `"SUCCESS" in stdout` |
| `output_matches` | stdout matches regex | `r"Total: \d+"` |
| `test_pass` | Test framework green | `pytest exit 0` |
| `runtime` | Runtime limit | `< 5000ms` |
| `no_stderr` | No error output | `stderr == ""` |
| `file_exists` | Output file created | `result.json exists` |
| `json_valid` | Output is valid JSON | `json.loads(stdout)` |

### At the End
Best code → `results/[target]-proposed.[ext]` + report.sh `--final`

---

## mode: audit

> **Only read if your task contains `mode: audit`!**

⚠️ **DO NOT write your own code.** Only test CLI commands of the target tool (`--help` + read-only).

### Two Variants

**Simple Audit (CLI skill, clear commands):**
- 2 iterations: Baseline → Proposed Fix
- For tools with clear `--help` output and simple command structure

**Deep Audit (complex docs, many checks):**
- Iterative loop like prompt/code, same stop conditions
- For extensive documentation with many checkpoints (e.g. config keys, tool policy, parameter lists)
- Recommended: Multi-Model setup (Opus Optimizer + external Validator)

### Simple Audit Flow
1. Write TSV header
2. **Iteration 1 (Baseline):** Test every documented command → pass rate → TSV + report
3. **Iteration 2 (Proposed Fix):** Write improved SKILL.md → expected pass rate → TSV + report
4. Improved SKILL.md → `results/[target]-proposed.md`
5. Detail results → `results/[target]-audit-details.md` (NOT in TSV!)
6. report.sh `--final`

### Deep Audit Flow
1. Write TSV header
2. **Iteration 1 (Baseline):** Extract ground truth from source, define all checks, evaluate baseline
3. **Iterations 2+:** Optimizer fixes issues → Validator checks → TSV + report per iteration
4. Loop runs until stop conditions trigger (3× 100%, 5× retained, 3× discard)
5. Final version → `results/[target]-proposed.md` or `results/[target]-v1.md`
6. report.sh `--final`

### Fixed Evals (audit)
1. Completeness — Does SKILL.md cover ≥80% of real commands/config?
2. Correctness — Are ≥90% of documented commands/params syntactically correct?
3. No stale references — Does everything documented actually exist?
4. No missing core features — Are all important features covered?
5. Workflow quality — Does quick-start actually work?

---

## mode: project

> **Only read if your task contains `mode: project`!**

⚠️ **This mode operates on an ENTIRE repository/directory**, not a single file. Cross-file consistency is the core feature — this is NOT "audit on many files."

### Three Phases

Project mode runs through three sequential phases. Phases 1 and 2 happen once (in Iteration 1 = Baseline). Phase 3 is the iterative fix loop.

---

### Phase 1: Scan & Plan

1. **Analyze the repo directory:**
   ```bash
   # Discover structure
   tree -L 3 --dirsfirst [target_dir]
   ls -la [target_dir]
   ```
2. **Identify relevant files** and classify by priority:

   | Priority | Files |
   |----------|-------|
   | **critical** | README, Dockerfile, CI workflows (.github/workflows), package.json/requirements.txt, main entry points |
   | **normal** | Tests, configs, scripts, .env.example, .gitignore |
   | **low** | Docs, examples, LICENSE, CHANGELOG |

3. **Build the File-Map** — a mental inventory of what exists and what's missing.
4. **Compose eval set:** Merge user-provided evals with auto-detected evals (see Default Evals below).

---

### Phase 2: Cross-File Analysis

Run consistency checks **across** files. Each check = one eval point:

| Check | What it verifies |
|-------|-----------------|
| **README ↔ CLI** | Documented commands/flags match actual `--help` output |
| **Dockerfile ↔ deps** | `requirements.txt` / `package.json` versions match what Dockerfile installs |
| **CI ↔ project structure** | Workflow references correct paths, scripts, test commands |
| **`.env.example` ↔ code** | Every env var in code has a corresponding entry in `.env.example` |
| **Imports ↔ dependencies** | Every `import` / `require` has a matching dependency declaration |
| **Tests ↔ source** | Test files exist for critical modules |
| **`.gitignore` ↔ artifacts** | Build outputs, secrets, and caches are excluded |

**Result of Phase 2:** A complete eval checklist with per-file and cross-file checks, each scored Yes/No.

---

### Phase 3: Iterative Fix Loop

Same loop logic as prompt/code/audit — TSV, report.sh, stop conditions. Key differences:

- **Multiple files** can be changed per iteration
- **Pass rate** = aggregated over ALL evals (file-specific + cross-file)
- **Fixes are minimal and surgical** — don't refactor blindly, only fix what improves pass rate
- **`change_description`** includes which files were touched: `"Fix Dockerfile + CI workflow sync"`

### Per Iteration: What you do
1. Evaluate current repo state against **all evals** (file-specific + cross-file)
2. Calculate pass rate: (passing evals / total evals) × 100
3. Compare with best previous pass rate → determine status
4. On `improved`: apply **minimal, surgical fixes** to the fewest files necessary
5. Verify the fix didn't break other evals (re-run affected checks)
6. Write TSV row + call report.sh
7. Check stop conditions

### Dry-Run vs Live

| Flag | Behavior |
|------|----------|
| `--dry-run` (default) | Fixed files → `results/[target]-proposed/` directory (mirrors repo structure). Original repo untouched. |
| `--live` | Files overwritten in-place. Originals backed up → `results/backups/` (preserving directory structure). |

### Default Evals (auto-applied unless overridden)

These evals are **automatically used** when the user doesn't provide custom evals. The agent detects which are applicable based on what exists in the repo:

| # | Eval | Condition |
|---|------|-----------|
| 1 | README accurate? (describes actual features/commands) | README exists |
| 2 | Tests present and green? (`pytest` / `npm test` / `go test`) | Test files or test config detected |
| 3 | CI configured and syntactically correct? | `.github/workflows/` or `.gitlab-ci.yml` exists |
| 4 | No hardcoded secrets? (`grep -rE "(password|api_key|token|secret)\s*="`) | Always |
| 5 | Dependencies complete? (`requirements.txt` ↔ imports, `package.json` ↔ requires) | Dependency file exists |
| 6 | Dockerfile functional? (`docker build` succeeds or Dockerfile syntax valid) | Dockerfile exists |
| 7 | `.gitignore` sensible? (no secrets, build artifacts excluded) | `.gitignore` exists |
| 8 | License present? | Always |

### Eval Scoring

```
Pass Rate = (Passing Evals / Total Applicable Evals) × 100
```

Evals that don't apply (e.g. "Dockerfile functional?" when no Dockerfile exists) are **excluded from the total**, not counted as passes.

### At the End
- `--dry-run`: All proposed changes → `results/[target]-proposed/` directory
- `--live`: Changes already applied, backups in `results/backups/`
- report.sh `--final`
- Optionally: `results/[target]-project-details.md` with per-file findings (NOT in TSV!)

---

## Directory Structure

```
autoforge/
├── SKILL.md                    ← This file
├── results/
│   ├── [target]-results.tsv    ← TSV logs
│   ├── [target]-proposed.md    ← Proposed improvement (prompt/audit)
│   ├── [target]-proposed/      ← Proposed repo changes (project mode)
│   │   ├── README.md
│   │   ├── Dockerfile
│   │   └── ...
│   ├── [target]-v1.md          ← Deep audit final version
│   ├── [target]-audit-details.md ← Audit details (audit mode only)
│   ├── [target]-project-details.md ← Project details (project mode only)
│   └── backups/                ← Auto-backups (--live)
│       ├── [file].bak          ← Single file backups (prompt/code/audit)
│       └── [target]-backup/    ← Full directory backup (project mode)
├── scripts/
│   ├── report.sh               ← Channel reporting
│   └── visualize.py            ← PNG chart (optional)
├── references/
│   ├── eval-examples.md        ← Pre-built evals
│   └── ml-mode.md              ← ML training guide
└── examples/
    ├── demo-results.tsv        ← Demo data
    └── example-config.json     ← Example configuration
```

---

## Examples (task descriptions, NOT CLI commands)

AutoForge is not a CLI tool — it's a **skill prompt** for the agent:

```
# Optimize a prompt
"Start autoforge mode: prompt for the coding-agent skill.
 Evals: PTY correct? Workspace protected? Clearly structured?"

# Audit a CLI skill (simple)
"Start autoforge mode: audit for notebooklm-py."

# Deep audit with multi-model
"Start autoforge mode: audit (deep) for subagents docs.
 Optimizer: Opus, Validator: GPT-5
 Extract ground truth from source, validate iteratively."

# Optimize code
"Start autoforge mode: code for backup.sh.
 File: ./backup.sh
 Test: bash backup.sh personal --dry-run
 Evals: exit_code==0, backup file created, < 10s runtime"

# Optimize a whole repository
"Start autoforge mode: project for ./my-app
 Evals: Tests green? CI correct? No hardcoded secrets? README accurate?"

# Project mode with custom focus
"Start autoforge mode: project for /path/to/api-server
 Focus: Docker + CI pipeline consistency
 Evals: docker build succeeds, CI workflow references correct paths,
        .env.example covers all env vars used in code"

# Project mode dry-run (default)
"Start autoforge mode: project for ./my-tool --dry-run
 Use default evals. Show me what needs fixing."
```

---

## Eval Examples → Mode Mapping

`references/eval-examples.md` provides ready-to-use Yes/No evals grouped by category. Here's how they map to AutoForge modes:

| eval-examples.md Category | AutoForge Mode | Notes |
|---------------------------|---------------|-------|
| Briefing, Email, Calendar, Summary, Proposal | `prompt` | Mental simulation with scenario evals |
| Python Script, Shell Script, API, Data Pipeline, Build | `code` | Real execution with measurable criteria |
| CI/CD, Docker, Helm, Kubernetes, Terraform | `code` or `project` | `code` for single files, `project` for cross-file |
| Code Review, API Documentation | `audit` | Verify docs match reality |
| Project / Repository, Cross-File Consistency, Security Baseline | `project` | Whole-repo scanning and cross-file checks |

Pick evals from the matching category and paste them into your task prompt as the eval set.

---

## Tips
- Always start with `--dry-run`
- `prompt` = think, `code` = execute, `audit` = test CLI, `project` = optimize repo
- Simple Audit for clear CLI skills, Deep Audit for complex docs
- Project mode scans the whole repo — cross-file consistency is the killer feature
- Multi-Model for Deep Audits: different models cover different blind spots
- At >3 discards after all fixes: check for validator noise, declare convergence if justified
- TSV + report.sh are NOT optional — they are the user interface
- For ML training: see `references/ml-mode.md`

Related Skills

---

3891

from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891

from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation

find-skills

3891

from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

tavily-search

3891

from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research

baidu-search

3891

from openclaw/skills

Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Data & Research

agent-autonomy-kit

3891

from openclaw/skills

Stop waiting for prompts. Keep working.

Workflow & Productivity

Meeting Prep

3891

from openclaw/skills

Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.

Workflow & Productivity

self-improvement

3891

from openclaw/skills

Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.

Agent Intelligence & Learning

botlearn-healthcheck

3891

from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure