agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

181 stars

bymajiayu000

View on GitHub Installation ↓

Best use case

agent-ops-testing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

Teams using agent-ops-testing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/agent-ops-testing/SKILL.md --create-dirs "https://raw.githubusercontent.com/majiayu000/claude-skill-registry/main/skills/core/agent-ops-testing/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/agent-ops-testing/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How agent-ops-testing Compares

Feature / Agent	agent-ops-testing	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

SKILL.md Source

# Testing Workflow

**Works with or without `aoc` CLI installed.** Issue tracking can be done via direct file editing.

## Purpose

Provide structured guidance for test design, execution, and analysis that goes beyond baseline capture. This skill covers test strategy during planning, incremental testing during implementation, and coverage analysis.

## Test Commands (from constitution)

```bash
# Python (uv/pytest)
uv run pytest                           # Run all tests
uv run pytest tests/ -v                 # Verbose output
uv run pytest tests/ -m "not slow"      # Skip slow tests
uv run pytest tests/ --tb=short -q      # Quick summary
uv run pytest --cov=src --cov-report=html  # Coverage report

# TypeScript/Node (vitest/jest)
npm run test                            # Run all tests
npm run test -- --coverage              # With coverage

# .NET (dotnet test)
dotnet test                             # Run all tests
dotnet test --collect:"XPlat Code Coverage"  # With coverage
```

## Issue Tracking (File-Based — Default)

| Operation | How to Do It |
|-----------|--------------|
| Create test issue | Append to `.agent/issues/medium.md` with type `TEST` |
| Create bug from failure | Append to `.agent/issues/high.md` with type `BUG` |
| Log test results | Edit issue's `### Log` section in priority file |

### Example: Post-Test Issue Creation (File-Based)

1. Increment `.agent/issues/.counter`
2. Append issue to appropriate priority file
3. Add log entry with test run results

## CLI Integration (when aoc is available)

When `aoc` CLI is detected in `.agent/tools.json`, these commands provide convenience shortcuts:

| Operation | Command |
|-----------|---------|
| Create test issue | `aoc issues create --type TEST --title "Add tests for..."` |
| Create bug from failure | `aoc issues create --type BUG --priority high --title "Test failure: ..."` |
| Log test results | `aoc issues update <ID> --log "Tests: 45 pass, 2 fail"` |

## Test Isolation (MANDATORY)

**Tests must NEVER create, modify, or delete files in the project folder.**

### Unit Tests
- Use mocks/patches for ALL file system operations
- Use in-memory data structures where possible
- NEVER call real file I/O against project paths
- Use `unittest.mock.patch` for `Path`, `open()`, file operations

### Integration Tests  
- ALWAYS use `pytest` `tmp_path` fixture (auto-cleaned)
- Use Docker containers for service dependencies (API, DB, etc.)
- Fixtures MUST handle cleanup on both success AND failure
- Test data lives ONLY in temp directories

### Forbidden Patterns
```python
# ❌ NEVER do this - pollutes project
Path(".agent/test.md").write_text("test")
Path("src/data/fixture.json").write_text("{}")
open("tests/output.log", "w").write("log")

# ✅ Always use tmp_path
def test_example(tmp_path):
    test_file = tmp_path / "test.md"
    test_file.write_text("test")  # Auto-cleaned
```

### Review Checklist (before approving tests)
- [ ] No hardcoded paths to project directories
- [ ] All file operations use `tmp_path` or mocks
- [ ] Integration tests use fixtures with cleanup
- [ ] Docker fixtures auto-remove containers

## When to Use

- During planning: designing test strategy for new features
- During implementation: running incremental tests
- During review: analyzing coverage and gaps
- On demand: investigating test failures, improving test suite

## Preconditions

- `.agent/constitution.md` exists with confirmed test command
- `.agent/baseline.md` exists (for comparison)

## Test Strategy Design

### For New Features

1. **Identify test levels needed**:
   - Unit tests: isolated function/method behavior
   - Integration tests: component interaction
   - E2E tests: user-facing workflows (if applicable)

2. **Define test cases from requirements**:
   - Happy path: expected inputs → expected outputs
   - Edge cases: boundary values, empty inputs, max values
   - Error cases: invalid inputs, failure scenarios
   - Regression cases: ensure existing behavior unchanged

3. **Document in task/plan**:
   ```markdown
   ## Test Strategy
   - Unit: [list of unit test cases]
   - Integration: [list of integration scenarios]
   - Edge cases: [specific edge cases to cover]
   - Not testing: [explicitly excluded with rationale]
   ```

### For Bug Fixes

1. Write failing test FIRST (reproduces the bug)
2. Fix the bug
3. Verify test passes
4. Check for related regression tests needed

## Test Execution

### Incremental Testing (during implementation)

After each implementation step:
1. Run the smallest reliable test subset covering changed code
2. If tests fail: stop, diagnose, fix before proceeding
3. Log test results in focus.md

### Full Test Suite (end of implementation)

1. Run complete test command from constitution
2. Compare results to baseline
3. Investigate ANY new failures (even in unrelated areas)

### Test Command Patterns

```bash
# Run specific test file
<test-runner> path/to/test_file.py

# Run tests matching pattern
<test-runner> -k "test_feature_name"

# Run with coverage
<test-runner> --coverage

# Run failed tests only (re-run)
<test-runner> --failed
```

Actual commands must come from constitution.

## Coverage Analysis

### Confidence-Based Coverage Thresholds (MANDATORY)

**Coverage requirements scale with confidence level:**

| Confidence | Line Coverage | Branch Coverage | Enforcement |
|------------|---------------|-----------------|-------------|
| LOW | ≥90% on changed code | ≥85% on changed code | HARD — blocks completion |
| NORMAL | ≥80% on changed code | ≥70% on changed code | SOFT — warning if missed |
| HIGH | Tests pass | N/A | MINIMAL — existing tests only |

**Rationale:**
- LOW confidence = more unknowns = more code paths to verify
- HIGH confidence = well-understood = existing tests sufficient

**Enforcement:**
```
🎯 COVERAGE CHECK — {CONFIDENCE} Confidence

Required: ≥{line_threshold}% line, ≥{branch_threshold}% branch
Actual:   {actual_line}% line, {actual_branch}% branch

[PASS] Coverage meets threshold
— OR —
[FAIL] Coverage below threshold — must add tests before completion
```

**For LOW confidence failures:**
- Coverage failure is a HARD BLOCK
- Cannot proceed until threshold is met
- Document why if threshold is truly unachievable (rare)

### When to Analyze Coverage

- After completing a feature (before critical review)
- When investigating untested code paths
- During improvement discovery

### Coverage Metrics to Track

| Metric | Target | Notes |
|--------|--------|-------|
| Line coverage | ≥80% for new code | Not a hard rule; quality over quantity |
| Branch coverage | Critical paths covered | Focus on decision points |
| Uncovered lines | Document rationale | Some code legitimately untestable |

### Coverage Gaps to Flag

- New code with 0% coverage → **must address**
- Error handling paths untested → **should address**
- Complex logic untested → **investigate**
- Generated/boilerplate untested → **acceptable**

## Test Quality Checklist

### Good Tests

- [ ] Test behavior, not implementation
- [ ] Independent (no test order dependencies)
- [ ] Deterministic (same result every run)
- [ ] Fast (< 1 second per unit test)
- [ ] Readable (test name describes scenario)
- [ ] Minimal mocking (only external dependencies)

### Anti-Patterns to Avoid

- ❌ Testing implementation details (breaks on refactor)
- ❌ Excessive mocking (tests mock, not real code)
- ❌ Flaky tests (intermittent failures)
- ❌ Slow tests without justification
- ❌ Tests that require manual setup
- ❌ Commented-out tests

## Failure Investigation

When tests fail unexpectedly, **invoke `agent-ops-debugging`**:

1. **Apply systematic debugging process**:
   - Isolate: Run failing test alone
   - Reproduce: Confirm failure is consistent
   - Form hypothesis: What might cause this?
   - Test hypothesis: Add logging, inspect state

2. **Categorize the failure**:
   | Category | Evidence | Action |
   |----------|----------|--------|
   | Agent's change | Test passed in baseline | Fix the change |
   | Pre-existing | Test failed in baseline | Document, create issue |
   | Flaky | Intermittent, no code change | Fix test or document |
   | Environment | Works elsewhere | Check constitution assumptions |

3. **Handoff decision**:
   ```
   🔍 Test failure analysis:
   
   - Test: {test_name}
   - Category: {agent_change | pre_existing | flaky | environment}
   - Root cause: {diagnosis}
   
   Next steps:
   1. Fix and re-run (if agent's change)
   2. Create issue and continue (if pre-existing)
   3. Deep dive with /agent-debug (if unclear)
   ```

## Output

After test activities, update:
- `.agent/focus.md`: test results summary
- `.agent/baseline.md`: if establishing new baseline

## Issue Discovery After Testing

**After test analysis, invoke `agent-ops-tasks` discovery procedure:**

1) **Collect test-related findings:**
   - Failing tests → `BUG` (high)
   - Missing test coverage → `TEST` (medium)
   - Flaky tests identified → `CHORE` (medium)
   - Test anti-patterns found → `REFAC` (low)
   - Missing edge case tests → `TEST` (medium)

2) **Present to user:**
   ```
   📋 Test analysis found {N} items:
   
   High:
   - [BUG] Flaky test: PaymentService.processAsync (failed 2/10 runs)
   
   Medium:
   - [TEST] Missing coverage for error handling in UserController
   - [TEST] No edge case tests for empty input scenarios
   
   Low:
   - [REFAC] Tests have excessive mocking in OrderService.test.ts
   
   Create issues for these? [A]ll / [S]elect / [N]one
   ```

3) **After creating issues:**
   ```
   Created {N} test-related issues. What's next?
   
   1. Start fixing highest priority (BUG-0024@abc123 - flaky test)
   2. Continue with current work
   3. Review test coverage report
   ```

```

Related Skills

add-backend-testing

181

from majiayu000/claude-skill-registry

Add backend integration testing with Vitest to an existing app. Sets up isolated test database schema and writes tests for tRPC routers.

adb-device-testing

181

from majiayu000/claude-skill-registry

Use when testing Android apps on ADB-connected devices/emulators - UI automation, screenshots, location spoofing, navigation, app management. Triggers on ADB, emulator, Android testing, location mock, UI test, screenshot walkthrough.

act-local-testing

181

from majiayu000/claude-skill-registry

Use when testing GitHub Actions workflows locally with act. Covers act CLI usage, Docker configuration, debugging workflows, and troubleshooting common issues when running workflows on your local machine.

accessibility-testing

181

from majiayu000/claude-skill-registry

WCAG 2.2 compliance testing, screen reader validation, and inclusive design verification. Use when ensuring legal compliance (ADA, Section 508), testing for disabilities, or building accessible applications for 1 billion disabled users globally.

acceptance-testing

181

from majiayu000/claude-skill-registry

Plan and (when feasible) implement or execute user acceptance tests (UAT) / end-to-end acceptance scenarios. Converts requirements or user stories into acceptance criteria, test cases, test data, and a sign-off checklist; suggests automation (Playwright/Cypress for web, golden/snapshot tests for CLIs/APIs). Use when validating user-visible behavior for a release, or mapping requirements to acceptance coverage.

acc-testing-knowledge

181

from majiayu000/claude-skill-registry

Testing knowledge base for PHP 8.5 projects. Provides testing pyramid, AAA pattern, naming conventions, isolation principles, DDD testing guidelines, and PHPUnit patterns.

ab-testing

181

from majiayu000/claude-skill-registry

Use when designing experiments for subject lines, offers, cadences, or journeys.

ab-testing-statistician

181

from majiayu000/claude-skill-registry

Expert in statistical analysis for blind A/B and ABX audio testing. Validates randomization, calculates statistical significance, and ensures proper experimental design. Use when implementing A/B test features or analyzing test results.

ab-testing-analyzer

181

from majiayu000/claude-skill-registry

全面的AB测试分析工具，支持实验设计、统计检验、用户分群分析和可视化报告生成。用于分析产品改版、营销活动、功能优化等AB测试结果，提供统计显著性检验和深度洞察。

a-b-testing

181

from majiayu000/claude-skill-registry

The science of learning through controlled experimentation. A/B testing isn't about picking winners—it's about building a culture of validated learning and reducing the cost of being wrong. This skill covers experiment design, statistical rigor, feature flagging, analysis, and building experimentation into product development. The best experimenters know that every test, positive or negative, teaches something valuable. Use when "a/b test, experiment, hypothesis, statistical significance, sample size, feature flag, variant, control, treatment, p-value, conversion rate, test winner, split test, experimentation, testing, statistics, feature-flags, hypothesis, growth, optimization, learning, validation" mentioned.

webapp-testing

181

from majiayu000/claude-skill-registry

Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.

Build Your Testing Skill

181

from majiayu000/claude-skill-registry

Create your agent-tdd skill in one prompt, then learn to improve it throughout the chapter