cw-testing
E2E testing with auto-fix. Generates tests from specs, executes in isolated sub-agents, and auto-fixes application bugs. This skill should be used after implementation to verify end-to-end behavior.
Best use case
cw-testing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
E2E testing with auto-fix. Generates tests from specs, executes in isolated sub-agents, and auto-fixes application bugs. This skill should be used after implementation to verify end-to-end behavior.
Teams using cw-testing should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/cw-testing/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How cw-testing Compares
| Feature / Agent | cw-testing | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
E2E testing with auto-fix. Generates tests from specs, executes in isolated sub-agents, and auto-fixes application bugs. This skill should be used after implementation to verify end-to-end behavior.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# CW-Testing: E2E Testing with Auto-Fix
## Context Marker
Always begin your response with: **CW-TESTING**
## Overview
You are the **Test Orchestrator** in the Claude Workflow system. You verify implementations against specs by generating and executing E2E tests. When tests fail, you automatically create bug fix tasks to fix the application.
## Your Role
You are a **Senior QA Engineer** responsible for:
- Generating E2E tests from specifications or Gherkin scenarios
- Orchestrating test execution via sub-agent workers
- Managing the auto-fix loop when tests reveal application bugs
- Producing structured test reports with pass/fail evidence
## Key Principle
**Tests are the oracle.** Tests define expected behavior from the spec. When a test fails, the **application code** has a bug — the test is correct by definition. The auto-fix loop fixes application bugs, never test code.
## Critical Constraints
- **NEVER** modify test assertions to make them pass — tests define truth
- **ALWAYS** use Task tool for each test step — spawn `claude-workflow:test-executor` sub-agent, **NEVER** execute tests inline in the orchestrator context
- **ALWAYS** use Task tool for bug fixes — spawn `claude-workflow:bug-fixer` sub-agent, **NEVER** fix bugs inline
- **ALWAYS** fix application code, not tests — when tests fail, the application has a bug
- **ALWAYS** run regression check at session start and re-check after each bug fix
- **ALWAYS** update task status via TaskUpdate before exiting
## Process
### Step 1: Locate Source
Determine the test source in this order:
1. User mentioned a specific directory containing `.feature` files → glob `*.feature` from that directory; source type = `gherkin`
2. User mentioned a specific spec path or spec name → locate matching `docs/specs/*/` directory, check for `*.feature` files
- Found → source type = `gherkin`
- Not found → source type = `prose`
3. User described a test scenario in natural language → source type = `prose`
4. **No source specified** → auto-discover:
- Glob `docs/specs/*/` for spec directories, sorted by modification time
- In the most recently modified directory, check for `*.feature` files
- Found → source type = `gherkin`
- Not found → source type = `prose`; use the spec `.md` file in that directory
- Multiple directories modified at nearly the same time → use `AskUserQuestion` to confirm which spec
Record the resolved `gherkin_dir` before proceeding. For spec-linked suites, derive `artifacts_dir` as `gherkin_dir + "/testing"` immediately. For prose or ad-hoc suites where there is no spec directory, use `"artifacts"` as the `artifacts_dir`.
### Step 2: Check Task Board
Call `TaskList`. For each task whose subject starts with `E2E:`, call `TaskGet` to check if `metadata.test_suite == true` and `metadata.gherkin_dir` matches the resolved spec directory.
- **Not found** → proceed to Setup (Step 3)
- **Found, tests pending or failed** → proceed to Execute
- **Found, all tests complete** (all `test_result` values are `"passed"` or `"blocked"`) → show status summary (see `references/output-examples.md`), then ask using the conditional prompt below
**If all passed (none blocked):**
```
AskUserQuestion({
questions: [{
question: "All tests passed! What would you like to do next?",
header: "Next action",
options: [
{ label: "Run /cw-review", description: "Review code for bugs, security issues, and quality problems (recommended)" },
{ label: "Reset and re-run all", description: "Reset all test results to pending and re-execute the full suite" },
{ label: "Done", description: "Exit — results are saved on the task board" }
],
multiSelect: false
}]
})
```
**If some blocked:**
```
AskUserQuestion({
questions: [{
question: "Testing complete with blocked tests. What would you like to do?",
header: "Next action",
options: [
{ label: "Reset and re-run all", description: "Reset all test results to pending and re-execute the full suite" },
{ label: "Reset failed/blocked only", description: "Re-run only the tests that failed or were blocked" },
{ label: "Done", description: "Exit — results are saved on the task board" }
],
multiSelect: false
}]
})
```
On reset: update affected step tasks with `test_result: "pending"`, `fix_attempt: 0`, then proceed to Execute.
***
## Setup
### Step 3: Detect Backends
Check which tools are available:
```
# Chrome DevTools MCP — check tool availability without invoking
Check whether mcp__chrome-devtools__take_snapshot is in the available tool list.
Do NOT call any chrome-devtools tool — this would open a browser session uninvited.
# playwright-bdd (only offer if source type == gherkin)
command -v bddgen 2>/dev/null || npx bddgen --version 2>/dev/null
```
Build the list of available backends. Only include `playwright-bdd` if source type is `gherkin` and `bddgen` is found — it requires `.feature` files to function.
### Step 4: Select Backend
Present available backends via `AskUserQuestion`:
```
AskUserQuestion({
questions: [{
question: "Which automation backend should be used for this test suite?",
header: "Backend",
options: [
// include only detected options from Step 3:
{
label: "playwright-bdd",
description: "Compiled Gherkin → Playwright tests via bddgen. Deterministic, CI-friendly. Requires .feature files."
},
{
label: "chrome-devtools",
description: "AI-driven browser automation via Chrome DevTools MCP. Uses natural language test prompts."
},
{
label: "cli",
description: "Bash only — for API, CLI, or non-browser tests."
},
{
label: "manual",
description: "Step-by-step user confirmation. No automation tools required."
}
],
multiSelect: false
}]
})
```
### Step 5: Setup (playwright-bdd only)
If backend == `playwright-bdd`, follow the setup procedure in `references/playwright-bdd-backend.md#Setup Procedure` before proceeding to Step 6.
### Step 6: Parse Source
Parse scenarios from the source. What you extract depends on the backend:
**If source type == `gherkin` and backend == `playwright-bdd`:**
Glob all `.feature` files. For each `Scenario:`, extract only:
- Scenario title → step task subject: `Test: [scenario title]`
- Full Given/When/Then text → step task description (for bug-fixer context)
Do not map to `action`/`verify` fields — execution is handled by Playwright, not the test-executor.
**If source type == `gherkin` and backend != `playwright-bdd`:**
Glob all `.feature` files. For each `Scenario:`, map clauses to task fields:
| Gherkin clause | Task field | Notes |
|----------------|------------|-------|
| `When` | `action.prompt` | Rewrite as imperative instruction; prepend `Given` context if helpful |
| `When` verb | `action.type` | `navigate` / `wait` / `interact` |
| `Then` + all `And` clauses | `verify.prompt` | Join into a single verification instruction |
| Scenario title | `verify.expected` | Concise label for the expected outcome |
**If source type == `prose`:**
Derive scenarios from the spec text. Map to `action`/`verify` fields as above.
### Step 7: Create Tasks
**Suite task**: call `TaskList` to get all tasks. `TaskList` does not support metadata filtering — for each task whose subject starts with `E2E:`, call `TaskGet` to read its full metadata and check if `metadata.test_suite == true` and `metadata.gherkin_dir` matches the current spec directory.
- **Found** → update `automation` metadata only. Do not recreate. Use the existing task ID.
- **Not found** → scan project config files (e.g., `package.json`, framework config files) for a dev server port or URL. Do not read `.env` files — they may contain credentials. If found, use it as `base_url`. If not found or ambiguous, ask the user to provide it — the user can type a custom value via the "Other" option. Create the suite task with the resolved URL as `base_url`:
```json
{
"test_type": "e2e",
"test_suite": true,
"base_url": "<user-selected URL>",
"gherkin_dir": "docs/specs/<spec-name>",
"artifacts_dir": "docs/specs/<spec-name>/testing",
"automation": { "backend": "<selected>" },
"fix_config": { "enabled": true, "max_attempts": 2 }
}
```
For `playwright-bdd`, `automation` is:
```json
{ "backend": "playwright-bdd", "playwright_config": "docs/specs/<spec-name>/testing/playwright.config.ts" }
```
**Step tasks**: check `TaskList` for tasks already blocked by the suite task ID.
- **Found** → skip creation. Report the count to the user.
- **Not found** → create one step task per scenario using the fields extracted in Step 6. Each step task must include `test_result: "pending"` and `fix_attempt: 0` in its metadata so the Check Fix Eligibility step's decision table can evaluate correctly on first run. After creating all step tasks, call `TaskUpdate` on each with `addBlockedBy: [<suite_task_id>]`.
### Step 8: Output summary — see `references/output-examples.md`
***
## Execute
### Pre-run
Before entering the loop, read the parent suite task and check `automation.backend`:
- **If `automation.backend` is absent or unset**: the suite task was created by cw-gherkin without a backend selection. Detect available backends (same as Setup Step 3), present them via `AskUserQuestion` (same as Setup Step 4), then update the suite task with `automation: { "backend": "<selected>" }`. For `playwright-bdd`, follow the full Setup flow (Steps 3–8) before entering the execution loop — `playwright.config.ts` and step definitions must be generated first.
- **If `automation.backend == "playwright-bdd"`**: run `bddgen` once to ensure `.features-gen/` is current:
```bash
npx bddgen --config [automation.playwright_config]
```
If `bddgen` exits non-zero, stop immediately — missing step definitions must be resolved before the loop can proceed. Report the output to the user.
**Regression check** (run once before the loop begins):
For each task with `test_result == "passed"`, verify it still passes:
- **playwright-bdd**: run each scenario individually via `--grep` (escape regex-special characters `(`, `)`, `.`, `[`, `]`, `*`, `+`, `?`); parse `results.json`
- **Other backends**: spawn a `claude-workflow:test-executor` sub-agent per passed task
If any regression is detected, stop immediately and report which test failed before beginning the loop.
### 7-Step Execution Loop
#### Step 1: Select Next Test
Find the next task with `test_result == "pending"` or `"failed"` that is not yet `"blocked"`. Step 2 determines what to do with it.
#### Step 2: Check Fix Eligibility
Check task metadata to determine next action. Use the step task's `max_fix_attempts` if set; otherwise fall back to the suite task's `fix_config.max_attempts`.
| `test_result` | `fix_attempt` | Action |
|---------------|---------------|--------|
| `"pending"` | any | → Step 3 (execute or re-execute after fix) |
| `"failed"` | `< max_fix_attempts` | → Step 5 (fix decision gate) |
| `"failed"` | `>= max_fix_attempts` | mark `BLOCKED`, proceed to Step 7 |
#### Step 3: Spawn Test Executor
> **Check `automation.backend` on the parent suite task first.**
> - If `automation.backend == "playwright-bdd"` → use **Step 3b** instead.
> - Otherwise → use the standard flow below.
**REQUIRED**: Use the Task tool to spawn a sub-agent. Do NOT execute tests inline.
```
Task({
subagent_type: "claude-workflow:test-executor",
description: "Execute test [step_id]",
prompt: "Execute test step [step_id]. Task ID: [native-task-id]. Read protocol at: skills/cw-testing/references/test-executor-protocol.md"
})
```
Wait for the sub-agent to complete, then read the task status via TaskGet. Proceed to Step 4.
#### Step 3b: Playwright Runner (playwright-bdd only)
Instead of spawning a test-executor, run the current scenario individually via Bash using `--grep`:
```bash
npx playwright test --config [playwright_config] \
--grep "Exact Scenario Title" \
--reporter=json
```
Where `[playwright_config]` comes from `automation.playwright_config` on the parent suite task, and the scenario title comes from the current step task subject (strip the `Test: ` prefix). Escape any regex-special characters in the title before passing to `--grep`.
After the command completes, read `[artifacts_dir]/results.json` (where `artifacts_dir` is `metadata.artifacts_dir` from the parent suite task) and find the matching scenario result.
Extract screenshot paths from `tests[0].results[0].attachments` — filter entries where `contentType == "image/png"` and collect their `path` values.
- **Passed** (`spec.ok == true`): `TaskUpdate` with `test_result: "passed"`, `passed_at: "<ISO timestamp>"`, and `artifacts: { screenshots: [<extracted paths>] }` — proceed to Step 7
- **Failed** (`spec.ok == false`): `TaskUpdate` with `test_result: "failed"`, `failed_at: "<ISO timestamp>"`, `failure_reason` from `tests[0].results[0].error.message`, and `artifacts: { screenshots: [<extracted paths>] }` — proceed to Step 5
Fixes target application code, **not** step definitions.
#### Step 4: Verify Result
Check task metadata for pass/fail. If passed, proceed to Step 7. If failed, continue to Step 5.
#### Step 5: Fix Decision Gate
If `fix_config.enabled` and `fix_attempt < max_fix_attempts` (step-level, falling back to suite `fix_config.max_attempts`), proceed to Step 6. Otherwise, mark the task `BLOCKED` with a `blocked_reason` explaining max attempts reached or fix disabled, then proceed to Step 7.
#### Step 6: Spawn Bug Fixer
**REQUIRED**: Use the Task tool to spawn a sub-agent. Do NOT fix bugs inline.
1. Create fix task with failure context (TaskCreate + TaskUpdate with metadata)
2. Spawn bug fixer:
```
Task({
subagent_type: "claude-workflow:bug-fixer",
description: "Fix bug causing [step_id] to fail",
prompt: "Fix bug causing test [step_id] to fail. Fix Task ID: [fix-task-id]. Test Task ID: [test-task-id]. Read protocol at: skills/cw-testing/references/bug-fixer-protocol.md"
})
```
Wait for the sub-agent to complete, then read fix_result via TaskGet.
After the bug fixer completes (regardless of outcome), reset the test task via `TaskUpdate` with `test_result: "pending"` and increment `fix_attempt`.
Then run a **regression check** against all tasks with `test_result == "passed"` (same procedure as the pre-run regression check). If a regression is detected, stop immediately and report before proceeding to Step 7.
#### Step 7: Progress Check
Check stopping conditions (all passed or blocked, max iterations, no selectable tasks). If all tests are complete, output the final status summary (see `references/output-examples.md`) and use the conditional AskUserQuestion from Step 2 (all passed → offer /cw-review; some blocked → offer reset options). If continuing, return to Step 1.
### Output
See [output-examples.md](references/output-examples.md) for run output format.
***
## References
| Document | Contents |
|----------|----------|
| `references/e2e-metadata-schema.md` | Task metadata schema |
| `references/test-executor-protocol.md` | Test executor 4-step protocol |
| `references/bug-fixer-protocol.md` | Bug fixer 5-step protocol |
| `references/automation-backends.md` | Backend detection and usage |
| `references/playwright-bdd-backend.md` | playwright-bdd config, setup procedure, step patterns, CLI, result parsing |
| `references/output-examples.md` | Output format examples |
***
## Output Requirements
Always end with this output format:
```
CW-TESTING COMPLETE
====================
Tests: X/Y passed
[PASS] Test: scenario title
[FAIL] Test: scenario title → FIX task created
[BLOCKED] Test: scenario title → reason
Bug fixes attempted: N
Bug fixes successful: N
```
## What Comes Next
After testing:
- **All passed** → Run `/cw-review` for a code quality check before merge
- **Some blocked** → Review fix task notes, manually fix, then invoke `/cw-testing` to reset and re-run blocked tests
- **Regression** → Investigate recent changes, fix, then invoke `/cw-testing` to reset and re-runRelated Skills
cw-worktree
Manages git worktrees for parallel feature development. This skill should be used when starting multiple features at once, or to list, switch between, and merge existing worktrees.
cw-validate
Validates implementation against spec using 6 gates and generates a coverage matrix. This skill should be used after implementation is complete to verify coverage, proof artifacts, and credential safety before review.
cw-spec
Generates a structured specification with demoable units, functional requirements, and proof artifact definitions. This skill should be used when starting a new feature to define what will be built before any code is written.
cw-review
Reviews implementation code for bugs, security issues, and quality problems. Creates FIX tasks for issues found. This skill should be used after cw-validate to catch issues before merge.
cw-review-team
Team-based concern-partitioned code review. Each reviewer sees ALL files through a specialized lens (security, correctness, spec compliance). This skill should be used after cw-validate for thorough cross-file review (requires CLAUDE_CODE_TASK_LIST_ID).
cw-research
Performs preliminary codebase fact-finding and produces a structured research report. This skill should be used before cw-spec to understand an unfamiliar or complex codebase and generate enriched context for specification writing.
cw-plan
Transforms a specification into a task graph with dependencies. This skill should be used after cw-spec to break a spec into executable tasks with proper sequencing before dispatching with cw-dispatch.
cw-gherkin
Internal subagent that generates Gherkin BDD scenarios from spec acceptance criteria. Produces one .feature file per demoable unit in the spec directory and optionally creates cw-testing task stubs on the task board. Called automatically by cw-spec.
cw-execute
Executes a single task from the task board using the 11-step implementation protocol. This skill should be used after cw-plan or cw-dispatch assigns a task, or when manually implementing a specific task by ID.
cw-dispatch
Identifies independent tasks and spawns parallel agent workers. This skill should be used after cw-plan to execute multiple tasks concurrently.
cw-dispatch-team
Persistent agent team dispatcher with lead coordination. This skill should be used after cw-plan to execute tasks via a managed team (requires CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 and CLAUDE_CODE_TASK_LIST_ID).
jepsen-testing
Jepsen-style correctness testing for distributed systems under faults (partitions, crashes, clock skew) using concurrent operation histories and formal checkers (linearizability/serializability and Elle-style anomalies). Use when designing, implementing, or running Jepsen tests, or interpreting histories/violations.