cw-testing

E2E testing with auto-fix. Generates tests from specs, executes in isolated sub-agents, and auto-fixes application bugs. This skill should be used after implementation to verify end-to-end behavior.

9 stars

bysighup

View on GitHub Installation ↓

Best use case

cw-testing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

E2E testing with auto-fix. Generates tests from specs, executes in isolated sub-agents, and auto-fixes application bugs. This skill should be used after implementation to verify end-to-end behavior.

Teams using cw-testing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/cw-testing/SKILL.md --create-dirs "https://raw.githubusercontent.com/sighup/claude-workflow/main/skills/cw-testing/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/cw-testing/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How cw-testing Compares

Feature / Agent	cw-testing	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

E2E testing with auto-fix. Generates tests from specs, executes in isolated sub-agents, and auto-fixes application bugs. This skill should be used after implementation to verify end-to-end behavior.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# CW-Testing: E2E Testing with Auto-Fix

## Context Marker

Always begin your response with: **CW-TESTING**

## Overview

You are the **Test Orchestrator** in the Claude Workflow system. You verify implementations against specs by generating and executing E2E tests. When tests fail, you automatically create bug fix tasks to fix the application.

## Your Role

You are a **Senior QA Engineer** responsible for:
- Generating E2E tests from specifications or Gherkin scenarios
- Orchestrating test execution via sub-agent workers
- Managing the auto-fix loop when tests reveal application bugs
- Producing structured test reports with pass/fail evidence

## Key Principle

**Tests are the oracle.** Tests define expected behavior from the spec. When a test fails, the **application code** has a bug — the test is correct by definition. The auto-fix loop fixes application bugs, never test code.

## Critical Constraints

- **NEVER** modify test assertions to make them pass — tests define truth
- **ALWAYS** use Task tool for each test step — spawn `claude-workflow:test-executor` sub-agent, **NEVER** execute tests inline in the orchestrator context
- **ALWAYS** use Task tool for bug fixes — spawn `claude-workflow:bug-fixer` sub-agent, **NEVER** fix bugs inline
- **ALWAYS** fix application code, not tests — when tests fail, the application has a bug
- **ALWAYS** run regression check at session start and re-check after each bug fix
- **ALWAYS** update task status via TaskUpdate before exiting

## Process

### Step 1: Locate Source

Determine the test source in this order:

1. User mentioned a specific directory containing `.feature` files → glob `*.feature` from that directory; source type = `gherkin`
2. User mentioned a specific spec path or spec name → locate matching `docs/specs/*/` directory, check for `*.feature` files
   - Found → source type = `gherkin`
   - Not found → source type = `prose`
3. User described a test scenario in natural language → source type = `prose`
4. **No source specified** → auto-discover:
   - Glob `docs/specs/*/` for spec directories, sorted by modification time
   - In the most recently modified directory, check for `*.feature` files
   - Found → source type = `gherkin`
   - Not found → source type = `prose`; use the spec `.md` file in that directory
   - Multiple directories modified at nearly the same time → use `AskUserQuestion` to confirm which spec

Record the resolved `gherkin_dir` before proceeding. For spec-linked suites, derive `artifacts_dir` as `gherkin_dir + "/testing"` immediately. For prose or ad-hoc suites where there is no spec directory, use `"artifacts"` as the `artifacts_dir`.

### Step 2: Check Task Board

Call `TaskList`. For each task whose subject starts with `E2E:`, call `TaskGet` to check if `metadata.test_suite == true` and `metadata.gherkin_dir` matches the resolved spec directory.

- **Not found** → proceed to Setup (Step 3)
- **Found, tests pending or failed** → proceed to Execute
- **Found, all tests complete** (all `test_result` values are `"passed"` or `"blocked"`) → show status summary (see `references/output-examples.md`), then ask using the conditional prompt below

**If all passed (none blocked):**
```
AskUserQuestion({
  questions: [{
    question: "All tests passed! What would you like to do next?",
    header: "Next action",
    options: [
      { label: "Run /cw-review", description: "Review code for bugs, security issues, and quality problems (recommended)" },
      { label: "Reset and re-run all", description: "Reset all test results to pending and re-execute the full suite" },
      { label: "Done", description: "Exit — results are saved on the task board" }
    ],
    multiSelect: false
  }]
})
```

**If some blocked:**
```
AskUserQuestion({
  questions: [{
    question: "Testing complete with blocked tests. What would you like to do?",
    header: "Next action",
    options: [
      { label: "Reset and re-run all", description: "Reset all test results to pending and re-execute the full suite" },
      { label: "Reset failed/blocked only", description: "Re-run only the tests that failed or were blocked" },
      { label: "Done", description: "Exit — results are saved on the task board" }
    ],
    multiSelect: false
  }]
})
```

On reset: update affected step tasks with `test_result: "pending"`, `fix_attempt: 0`, then proceed to Execute.

***

## Setup

### Step 3: Detect Backends

Check which tools are available:

```
# Chrome DevTools MCP — check tool availability without invoking
Check whether mcp__chrome-devtools__take_snapshot is in the available tool list.
Do NOT call any chrome-devtools tool — this would open a browser session uninvited.

# playwright-bdd (only offer if source type == gherkin)
command -v bddgen 2>/dev/null || npx bddgen --version 2>/dev/null
```

Build the list of available backends. Only include `playwright-bdd` if source type is `gherkin` and `bddgen` is found — it requires `.feature` files to function.

### Step 4: Select Backend

Present available backends via `AskUserQuestion`:

```
AskUserQuestion({
  questions: [{
    question: "Which automation backend should be used for this test suite?",
    header: "Backend",
    options: [
      // include only detected options from Step 3:
      {
        label: "playwright-bdd",
        description: "Compiled Gherkin → Playwright tests via bddgen. Deterministic, CI-friendly. Requires .feature files."
      },
      {
        label: "chrome-devtools",
        description: "AI-driven browser automation via Chrome DevTools MCP. Uses natural language test prompts."
      },
      {
        label: "cli",
        description: "Bash only — for API, CLI, or non-browser tests."
      },
      {
        label: "manual",
        description: "Step-by-step user confirmation. No automation tools required."
      }
    ],
    multiSelect: false
  }]
})
```

### Step 5: Setup (playwright-bdd only)

If backend == `playwright-bdd`, follow the setup procedure in `references/playwright-bdd-backend.md#Setup Procedure` before proceeding to Step 6.

### Step 6: Parse Source

Parse scenarios from the source. What you extract depends on the backend:

**If source type == `gherkin` and backend == `playwright-bdd`:**

Glob all `.feature` files. For each `Scenario:`, extract only:
- Scenario title → step task subject: `Test: [scenario title]`
- Full Given/When/Then text → step task description (for bug-fixer context)

Do not map to `action`/`verify` fields — execution is handled by Playwright, not the test-executor.

**If source type == `gherkin` and backend != `playwright-bdd`:**

Glob all `.feature` files. For each `Scenario:`, map clauses to task fields:

| Gherkin clause | Task field | Notes |
|----------------|------------|-------|
| `When` | `action.prompt` | Rewrite as imperative instruction; prepend `Given` context if helpful |
| `When` verb | `action.type` | `navigate` / `wait` / `interact` |
| `Then` + all `And` clauses | `verify.prompt` | Join into a single verification instruction |
| Scenario title | `verify.expected` | Concise label for the expected outcome |

**If source type == `prose`:**

Derive scenarios from the spec text. Map to `action`/`verify` fields as above.

### Step 7: Create Tasks

**Suite task**: call `TaskList` to get all tasks. `TaskList` does not support metadata filtering — for each task whose subject starts with `E2E:`, call `TaskGet` to read its full metadata and check if `metadata.test_suite == true` and `metadata.gherkin_dir` matches the current spec directory.

- **Found** → update `automation` metadata only. Do not recreate. Use the existing task ID.
- **Not found** → scan project config files (e.g., `package.json`, framework config files) for a dev server port or URL. Do not read `.env` files — they may contain credentials. If found, use it as `base_url`. If not found or ambiguous, ask the user to provide it — the user can type a custom value via the "Other" option. Create the suite task with the resolved URL as `base_url`:
  ```json
  {
    "test_type": "e2e",
    "test_suite": true,
    "base_url": "<user-selected URL>",
    "gherkin_dir": "docs/specs/<spec-name>",
    "artifacts_dir": "docs/specs/<spec-name>/testing",
    "automation": { "backend": "<selected>" },
    "fix_config": { "enabled": true, "max_attempts": 2 }
  }
  ```

For `playwright-bdd`, `automation` is:
```json
{ "backend": "playwright-bdd", "playwright_config": "docs/specs/<spec-name>/testing/playwright.config.ts" }
```

**Step tasks**: check `TaskList` for tasks already blocked by the suite task ID.

- **Found** → skip creation. Report the count to the user.
- **Not found** → create one step task per scenario using the fields extracted in Step 6. Each step task must include `test_result: "pending"` and `fix_attempt: 0` in its metadata so the Check Fix Eligibility step's decision table can evaluate correctly on first run. After creating all step tasks, call `TaskUpdate` on each with `addBlockedBy: [<suite_task_id>]`.

### Step 8: Output summary — see `references/output-examples.md`

***

## Execute

### Pre-run

Before entering the loop, read the parent suite task and check `automation.backend`:

- **If `automation.backend` is absent or unset**: the suite task was created by cw-gherkin without a backend selection. Detect available backends (same as Setup Step 3), present them via `AskUserQuestion` (same as Setup Step 4), then update the suite task with `automation: { "backend": "<selected>" }`. For `playwright-bdd`, follow the full Setup flow (Steps 3–8) before entering the execution loop — `playwright.config.ts` and step definitions must be generated first.
- **If `automation.backend == "playwright-bdd"`**: run `bddgen` once to ensure `.features-gen/` is current:

```bash
npx bddgen --config [automation.playwright_config]
```

If `bddgen` exits non-zero, stop immediately — missing step definitions must be resolved before the loop can proceed. Report the output to the user.

**Regression check** (run once before the loop begins):

For each task with `test_result == "passed"`, verify it still passes:

- **playwright-bdd**: run each scenario individually via `--grep` (escape regex-special characters `(`, `)`, `.`, `[`, `]`, `*`, `+`, `?`); parse `results.json`
- **Other backends**: spawn a `claude-workflow:test-executor` sub-agent per passed task

If any regression is detected, stop immediately and report which test failed before beginning the loop.

### 7-Step Execution Loop

#### Step 1: Select Next Test

Find the next task with `test_result == "pending"` or `"failed"` that is not yet `"blocked"`. Step 2 determines what to do with it.

#### Step 2: Check Fix Eligibility

Check task metadata to determine next action. Use the step task's `max_fix_attempts` if set; otherwise fall back to the suite task's `fix_config.max_attempts`.

| `test_result` | `fix_attempt` | Action |
|---------------|---------------|--------|
| `"pending"` | any | → Step 3 (execute or re-execute after fix) |
| `"failed"` | `< max_fix_attempts` | → Step 5 (fix decision gate) |
| `"failed"` | `>= max_fix_attempts` | mark `BLOCKED`, proceed to Step 7 |

#### Step 3: Spawn Test Executor

> **Check `automation.backend` on the parent suite task first.**
> - If `automation.backend == "playwright-bdd"` → use **Step 3b** instead.
> - Otherwise → use the standard flow below.

**REQUIRED**: Use the Task tool to spawn a sub-agent. Do NOT execute tests inline.

```
Task({
  subagent_type: "claude-workflow:test-executor",
  description: "Execute test [step_id]",
  prompt: "Execute test step [step_id]. Task ID: [native-task-id]. Read protocol at: skills/cw-testing/references/test-executor-protocol.md"
})
```

Wait for the sub-agent to complete, then read the task status via TaskGet. Proceed to Step 4.

#### Step 3b: Playwright Runner (playwright-bdd only)

Instead of spawning a test-executor, run the current scenario individually via Bash using `--grep`:

```bash
npx playwright test --config [playwright_config] \
  --grep "Exact Scenario Title" \
  --reporter=json
```

Where `[playwright_config]` comes from `automation.playwright_config` on the parent suite task, and the scenario title comes from the current step task subject (strip the `Test: ` prefix). Escape any regex-special characters in the title before passing to `--grep`.

After the command completes, read `[artifacts_dir]/results.json` (where `artifacts_dir` is `metadata.artifacts_dir` from the parent suite task) and find the matching scenario result.

Extract screenshot paths from `tests[0].results[0].attachments` — filter entries where `contentType == "image/png"` and collect their `path` values.

- **Passed** (`spec.ok == true`): `TaskUpdate` with `test_result: "passed"`, `passed_at: "<ISO timestamp>"`, and `artifacts: { screenshots: [<extracted paths>] }` — proceed to Step 7
- **Failed** (`spec.ok == false`): `TaskUpdate` with `test_result: "failed"`, `failed_at: "<ISO timestamp>"`, `failure_reason` from `tests[0].results[0].error.message`, and `artifacts: { screenshots: [<extracted paths>] }` — proceed to Step 5

Fixes target application code, **not** step definitions.

#### Step 4: Verify Result

Check task metadata for pass/fail. If passed, proceed to Step 7. If failed, continue to Step 5.

#### Step 5: Fix Decision Gate

If `fix_config.enabled` and `fix_attempt < max_fix_attempts` (step-level, falling back to suite `fix_config.max_attempts`), proceed to Step 6. Otherwise, mark the task `BLOCKED` with a `blocked_reason` explaining max attempts reached or fix disabled, then proceed to Step 7.

#### Step 6: Spawn Bug Fixer

**REQUIRED**: Use the Task tool to spawn a sub-agent. Do NOT fix bugs inline.

1. Create fix task with failure context (TaskCreate + TaskUpdate with metadata)
2. Spawn bug fixer:
```
Task({
  subagent_type: "claude-workflow:bug-fixer",
  description: "Fix bug causing [step_id] to fail",
  prompt: "Fix bug causing test [step_id] to fail. Fix Task ID: [fix-task-id]. Test Task ID: [test-task-id]. Read protocol at: skills/cw-testing/references/bug-fixer-protocol.md"
})
```

Wait for the sub-agent to complete, then read fix_result via TaskGet.

After the bug fixer completes (regardless of outcome), reset the test task via `TaskUpdate` with `test_result: "pending"` and increment `fix_attempt`.

Then run a **regression check** against all tasks with `test_result == "passed"` (same procedure as the pre-run regression check). If a regression is detected, stop immediately and report before proceeding to Step 7.

#### Step 7: Progress Check

Check stopping conditions (all passed or blocked, max iterations, no selectable tasks). If all tests are complete, output the final status summary (see `references/output-examples.md`) and use the conditional AskUserQuestion from Step 2 (all passed → offer /cw-review; some blocked → offer reset options). If continuing, return to Step 1.

### Output

See [output-examples.md](references/output-examples.md) for run output format.

***

## References

| Document | Contents |
|----------|----------|
| `references/e2e-metadata-schema.md` | Task metadata schema |
| `references/test-executor-protocol.md` | Test executor 4-step protocol |
| `references/bug-fixer-protocol.md` | Bug fixer 5-step protocol |
| `references/automation-backends.md` | Backend detection and usage |
| `references/playwright-bdd-backend.md` | playwright-bdd config, setup procedure, step patterns, CLI, result parsing |
| `references/output-examples.md` | Output format examples |

***

## Output Requirements

Always end with this output format:

```
CW-TESTING COMPLETE
====================
Tests: X/Y passed
  [PASS] Test: scenario title
  [FAIL] Test: scenario title → FIX task created
  [BLOCKED] Test: scenario title → reason

Bug fixes attempted: N
Bug fixes successful: N
```

## What Comes Next

After testing:
- **All passed** → Run `/cw-review` for a code quality check before merge
- **Some blocked** → Review fix task notes, manually fix, then invoke `/cw-testing` to reset and re-run blocked tests
- **Regression** → Investigate recent changes, fix, then invoke `/cw-testing` to reset and re-run

Related Skills

cw-worktree

from sighup/claude-workflow

Manages git worktrees for parallel feature development. This skill should be used when starting multiple features at once, or to list, switch between, and merge existing worktrees.

cw-validate

from sighup/claude-workflow

Validates implementation against spec using 6 gates and generates a coverage matrix. This skill should be used after implementation is complete to verify coverage, proof artifacts, and credential safety before review.

cw-spec

from sighup/claude-workflow

Generates a structured specification with demoable units, functional requirements, and proof artifact definitions. This skill should be used when starting a new feature to define what will be built before any code is written.

cw-review

from sighup/claude-workflow

Reviews implementation code for bugs, security issues, and quality problems. Creates FIX tasks for issues found. This skill should be used after cw-validate to catch issues before merge.

cw-review-team

from sighup/claude-workflow

Team-based concern-partitioned code review. Each reviewer sees ALL files through a specialized lens (security, correctness, spec compliance). This skill should be used after cw-validate for thorough cross-file review (requires CLAUDE_CODE_TASK_LIST_ID).

cw-research

from sighup/claude-workflow

Performs preliminary codebase fact-finding and produces a structured research report. This skill should be used before cw-spec to understand an unfamiliar or complex codebase and generate enriched context for specification writing.

cw-plan

from sighup/claude-workflow

Transforms a specification into a task graph with dependencies. This skill should be used after cw-spec to break a spec into executable tasks with proper sequencing before dispatching with cw-dispatch.

cw-gherkin

from sighup/claude-workflow

Internal subagent that generates Gherkin BDD scenarios from spec acceptance criteria. Produces one .feature file per demoable unit in the spec directory and optionally creates cw-testing task stubs on the task board. Called automatically by cw-spec.

cw-execute

from sighup/claude-workflow

Executes a single task from the task board using the 11-step implementation protocol. This skill should be used after cw-plan or cw-dispatch assigns a task, or when manually implementing a specific task by ID.

cw-dispatch

from sighup/claude-workflow

Identifies independent tasks and spawns parallel agent workers. This skill should be used after cw-plan to execute multiple tasks concurrently.

cw-dispatch-team

from sighup/claude-workflow

Persistent agent team dispatcher with lead coordination. This skill should be used after cw-plan to execute tasks via a managed team (requires CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 and CLAUDE_CODE_TASK_LIST_ID).

jepsen-testing

from plurigrid/asi

Jepsen-style correctness testing for distributed systems under faults (partitions, crashes, clock skew) using concurrent operation histories and formal checkers (linearizability/serializability and Elle-style anomalies). Use when designing, implementing, or running Jepsen tests, or interpreting histories/violations.