nw-agent-testing

5-layer testing approach for agent validation including adversarial testing, security validation, and prompt injection resistance

322 stars

Best use case

nw-agent-testing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

5-layer testing approach for agent validation including adversarial testing, security validation, and prompt injection resistance

Teams using nw-agent-testing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/nw-agent-testing/SKILL.md --create-dirs "https://raw.githubusercontent.com/nWave-ai/nWave/main/nWave/skills/nw-agent-testing/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/nw-agent-testing/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How nw-agent-testing Compares

Feature / Agentnw-agent-testingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

5-layer testing approach for agent validation including adversarial testing, security validation, and prompt injection resistance

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Agent Testing Framework

## 5-Layer Testing Approach

### Layer 1: Output Quality (Unit-Level)

Validate agent produces correct, well-structured outputs for typical inputs.

**Test**: Agent follows workflow phases | Outputs match expected format/structure | Domain-specific rules correctly applied | Token efficiency within bounds

**How**: Manual invocation with representative inputs. Check against acceptance criteria in agent description.

### Layer 2: Integration / Handoff Validation

Validate correct input/output between agents in workflows.

**Test**: Input parsing handles upstream format | Output format matches downstream expectations | Error signals propagate correctly | Subagent mode activation works (skip greet, execute autonomously)

**How**: End-to-end workflow execution through full agent chain (e.g., DISCUSS -> DESIGN -> DELIVER).

### Layer 3: Adversarial Output Validation

Challenge validity of agent outputs rather than accepting at face value.

**Test**: Source verification (cited sources real and accurate?) | Bias detection (favors one approach without evidence?) | Edge case coverage | Completeness (required sections present?)

**How**: Peer review by `-reviewer` agent using structured critique dimensions.

### Layer 4: Adversarial Verification (Peer Review)

Independent review to catch biases and blind spots in agent design.

**Test**: Definition follows validation checklist? | Redundant Claude default instructions? | Over/under-specified? | Could simpler agent achieve same results?

**How**: `@nw-agent-builder` validates via 11-point checklist or `@agent-builder-reviewer` runs structured review.

### Layer 5: Security Validation

Test resilience against misuse and prompt injection.

**Test**: Tool restriction enforcement | maxTurns respected | Permission mode correctly scoped | Agent stays within declared scope

**How**: Frontmatter fields enforce at platform level. Verify configuration.

## Prompt Injection Resistance

Claude Code platform provides injection resistance through: subagent isolation (own context, no sub-subagents) | Tool restriction via frontmatter `tools` | Permission modes via `permissionMode` | Hook-based validation (PreToolUse, PostToolUse)

Do NOT add prose-based injection defense. Configure platform features:

```yaml
---
tools: Read, Glob, Grep           # Only tools this agent needs
maxTurns: 30                       # Prevents runaway execution
permissionMode: default            # User approves dangerous actions
---
```

## Security Validation Checklist

- [ ] `tools` restricted to minimum necessary (least privilege)
- [ ] `maxTurns` set to prevent runaway execution
- [ ] `permissionMode` appropriate for risk level
- [ ] No `Bash` unless agent requires command execution
- [ ] No `Write` unless agent creates/modifies files
- [ ] Description accurately describes scope
- [ ] Subagent mode handles autonomous execution correctly
- [ ] No sensitive data hardcoded in definition

## Testing Workflow for New Agents

1. **Create** with minimal definition
2. **Layer 1**: Invoke with 2-3 representative inputs, check outputs
3. **Layer 2**: Run in workflow chain if applicable
4. **Fix** failures observed
5. **Validate**: Run 11-point checklist
6. **Iterate**: Add instructions only for observed failure modes

Related Skills

nw-property-based-testing

322
from nWave-ai/nWave

Property-based testing strategies, mutation testing, shrinking, and combined PBT+mutation workflow for test quality validation

nw-hexagonal-testing

322
from nWave-ai/nWave

5-layer agent output validation, I/O contract specification, vertical slice development, and test doubles policy with per-layer examples

nw-ux-web-patterns

322
from nWave-ai/nWave

Web UI design patterns for product owners. Load when designing web application interfaces, writing web-specific acceptance criteria, or evaluating responsive designs.

nw-ux-tui-patterns

322
from nWave-ai/nWave

Terminal UI and CLI design patterns for product owners. Load when designing command-line tools, interactive terminal applications, or writing CLI-specific acceptance criteria.

nw-ux-principles

322
from nWave-ai/nWave

Core UX principles for product owners. Load when evaluating interface designs, writing acceptance criteria with UX requirements, or reviewing wireframes and mockups.

nw-ux-emotional-design

322
from nWave-ai/nWave

Emotional design and delight patterns for product owners. Load when designing onboarding flows, empty states, first-run experiences, or evaluating the emotional quality of an interface.

nw-ux-desktop-patterns

322
from nWave-ai/nWave

Desktop application UI patterns for product owners. Load when designing native or cross-platform desktop applications, writing desktop-specific acceptance criteria, or evaluating panel layouts and keyboard workflows.

nw-user-story-mapping

322
from nWave-ai/nWave

User story mapping for backlog management and outcome-based prioritization. Load during Phase 2.5 (User Story Mapping) to produce story-map.md and prioritization.md.

nw-tr-review-criteria

322
from nWave-ai/nWave

Review dimensions and scoring for root cause analysis quality assessment

nw-tlaplus-verification

322
from nWave-ai/nWave

TLA+ formal verification for design correctness and PBT pipeline integration

nw-test-refactoring-catalog

322
from nWave-ai/nWave

Detailed refactoring mechanics with step-by-step procedures, and test code smell catalog with detection patterns and before/after examples

nw-test-organization-conventions

322
from nWave-ai/nWave

Test directory structure patterns by architecture style, language conventions, naming rules, and fixture placement. Decision tree for selecting test organization strategy.