AI Agent Skill HUB

Codex

regression-baseline

Create and maintain regression test baselines for comparison and drift detection across versioned snapshots

104 stars

View on GitHub Installation ↓

Best use case

regression-baseline is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

It is a strong fit for teams already working in Codex.

Create and maintain regression test baselines for comparison and drift detection across versioned snapshots

Teams using regression-baseline should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/regression-baseline/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/.agents/skills/regression-baseline/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/regression-baseline/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How regression-baseline Compares

Feature / Agent	regression-baseline	Standard Approach
Platform Support	Codex	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Create and maintain regression test baselines for comparison and drift detection across versioned snapshots

Which AI agents support this skill?

This skill is designed for Codex.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

SKILL.md Source

# regression-baseline

Create and maintain regression test baselines for comparison and drift detection.

## Triggers


Alternate expressions and non-obvious activations (primary phrases are matched automatically from the skill description):

- "snapshot tests" → baseline creation shorthand
- "capture current behavior" → baseline establishment

## Purpose

This skill manages regression baselines by:
- Capturing known-good system states as baselines
- Storing snapshots of expected outputs
- Maintaining versioned baseline history
- Comparing current state to baseline
- Detecting drift from established baselines
- Managing baseline updates and approvals

## Behavior

When triggered, this skill:

1. **Identifies baseline scope**:
   - Determine what to baseline (tests, outputs, performance)
   - Select baseline type (functional, visual, performance, API)
   - Identify affected components
   - Check for existing baselines

2. **Captures baseline data**:
   - Run tests and capture outputs
   - Take snapshots (visual, data, API responses)
   - Measure performance metrics
   - Record system configuration
   - Generate checksums/hashes

3. **Stores baseline**:
   - Version baseline with semantic versioning
   - Tag with git commit/release
   - Store in `.aiwg/testing/baselines/`
   - Update baseline manifest
   - Link to requirements/features

4. **Validates baseline quality**:
   - Ensure baseline is stable (run multiple times)
   - Verify baseline is reproducible
   - Check for known issues in baseline
   - Require approval for critical baselines

5. **Documents baseline**:
   - Record what is baselined
   - Note when and why baseline was created
   - Document known deviations
   - Link to issues/requirements

6. **Manages baseline lifecycle**:
   - Archive old baselines
   - Update baselines on intentional changes
   - Track baseline drift over time
   - Alert on excessive drift

## Baseline Types

### Functional Baseline

```yaml
functional_baseline:
  description: Expected behavior outputs
  scope: Unit and integration tests
  format: JSON snapshots

  capture:
    - test_outputs
    - assertions
    - mock_data
    - expected_errors

  example:
    test: "user registration"
    baseline: "baselines/functional/user-registration-v1.json"
    includes:
      - success_response_format
      - validation_error_messages
      - database_state_after_registration
```

### Visual Baseline

```yaml
visual_baseline:
  description: UI screenshots for visual regression
  scope: Frontend components
  format: PNG images with metadata

  capture:
    - component_screenshots
    - page_screenshots
    - responsive_breakpoints
    - interaction_states

  example:
    component: "LoginForm"
    baseline: "baselines/visual/login-form-v2.png"
    metadata:
      viewport: 1920x1080
      theme: dark
      state: default
```

### Performance Baseline

```yaml
performance_baseline:
  description: Performance benchmarks
  scope: API response times, load tests
  format: JSON metrics

  capture:
    - response_times_p50_p95_p99
    - throughput_requests_per_second
    - resource_utilization
    - database_query_times

  example:
    endpoint: "/api/users"
    baseline: "baselines/performance/api-users-v1.json"
    metrics:
      p50_response_time_ms: 45
      p95_response_time_ms: 120
      p99_response_time_ms: 250
      throughput_rps: 1000
```

### API Contract Baseline

```yaml
api_baseline:
  description: API request/response contracts
  scope: REST/GraphQL APIs
  format: OpenAPI/JSON schemas

  capture:
    - request_schemas
    - response_schemas
    - status_codes
    - headers

  example:
    endpoint: "POST /api/auth/login"
    baseline: "baselines/api/auth-login-v3.yaml"
    contract:
      request_body_schema: LoginRequest
      success_response: 200 with token
      error_responses: [400, 401, 429]
```

## Baseline Manifest

```yaml
# .aiwg/testing/baselines/manifest.yaml

baselines:
  - id: functional-auth-v1
    type: functional
    created: 2026-01-28T10:00:00Z
    created_by: test-engineer
    git_commit: abc123def
    release: v1.2.0
    scope: Authentication flows
    files:
      - baselines/functional/login-v1.json
      - baselines/functional/logout-v1.json
      - baselines/functional/register-v1.json
    status: active
    approved_by: tech-lead
    notes: Initial baseline for auth module

  - id: visual-dashboard-v2
    type: visual
    created: 2026-01-20T14:30:00Z
    created_by: frontend-engineer
    git_commit: def456ghi
    release: v1.1.0
    scope: Dashboard UI components
    files:
      - baselines/visual/dashboard-desktop-v2.png
      - baselines/visual/dashboard-mobile-v2.png
    status: active
    approved_by: design-lead
    previous_baseline: visual-dashboard-v1
    changes: Updated color scheme per design system v2

  - id: performance-api-v1
    type: performance
    created: 2026-01-15T09:00:00Z
    created_by: devops-engineer
    git_commit: ghi789jkl
    release: v1.0.0
    scope: Core API endpoints
    files:
      - baselines/performance/api-benchmarks-v1.json
    status: active
    approved_by: architect
    notes: Pre-optimization baseline
```

## Baseline Comparison Report

```markdown
# Baseline Comparison Report

**Date**: 2026-01-28
**Baseline**: functional-auth-v1
**Current State**: HEAD (commit xyz789)
**Comparison Type**: Functional

## Executive Summary

**Status**: ⚠️ Drift Detected
**Changes**: 3 outputs differ from baseline
**Severity**: Medium - Unexpected behavior changes

| Metric | Baseline | Current | Drift |
|--------|----------|---------|-------|
| Tests Passing | 45/45 | 43/45 | -2 |
| Output Matches | 100% | 93.3% | -6.7% |
| New Failures | 0 | 2 | +2 |

## Detailed Comparison

### Test: user-login

**Status**: ⚠️ Drift

**Baseline** (functional-auth-v1):
```json
{
  "status": 200,
  "body": {
    "token": "jwt.header.payload.signature",
    "user": {
      "id": "uuid",
      "email": "user@example.com"
    }
  }
}
```

**Current**:
```json
{
  "status": 200,
  "body": {
    "token": "jwt.header.payload.signature",
    "user": {
      "id": "uuid",
      "email": "user@example.com",
      "name": "John Doe"  // NEW FIELD
    }
  }
}
```

**Analysis**: New `name` field added to response
**Impact**: Breaking change for clients expecting exact schema
**Recommendation**: Update baseline if intentional, or fix if bug

### Test: user-registration-invalid-email

**Status**: ❌ Failure

**Baseline**:
```json
{
  "status": 400,
  "body": {
    "error": "Invalid email format"
  }
}
```

**Current**:
```json
{
  "status": 500,
  "body": {
    "error": "Internal server error"
  }
}
```

**Analysis**: Validation now returns 500 instead of 400
**Impact**: Critical - Server error on invalid input
**Recommendation**: Fix immediately - regression in error handling

### Test: user-logout

**Status**: ✅ Match

Output matches baseline exactly. No drift detected.

## Drift Summary

| Test | Status | Drift Type | Severity |
|------|--------|------------|----------|
| user-login | ⚠️ Drift | Schema change | Medium |
| user-registration | ✅ Match | None | - |
| user-registration-invalid | ❌ Fail | Error code | High |
| user-logout | ✅ Match | None | - |
| token-refresh | ✅ Match | None | - |

## Root Cause Analysis

### Likely Causes

1. **user-login drift**: Intentional feature addition (commit abc123)
   - PR #789: "Add user name to login response"
   - Needs baseline update

2. **invalid-email failure**: Unintentional regression
   - Validation middleware broken
   - Returns 500 instead of 400
   - Introduced in commit def456

## Recommendations

### Immediate Actions

- [ ] Fix validation error handling (returns 500 instead of 400)
- [ ] Add test for proper error codes
- [ ] Verify error handling across all validation

### Baseline Updates

- [ ] Update baseline for user-login if name field is intentional
- [ ] Document breaking change in API changelog
- [ ] Notify API consumers of schema change

### Process Improvements

- [ ] Require baseline review for API changes
- [ ] Add baseline diff to PR template
- [ ] Run baseline comparison in CI

## Approval Required

To update baseline with new outputs:

```bash
# Review changes
aiwg baseline compare functional-auth-v1

# Update if intentional
aiwg baseline update functional-auth-v1 \
  --approve-changes user-login \
  --justification "Added name field per REQ-123"

# Or create new baseline version
aiwg baseline create functional-auth-v2 \
  --based-on v1 \
  --changes "Added user name to login response"
```
```

## Usage Examples

### Create New Baseline

```
User: "Create baseline for authentication tests"

Skill executes:
1. Identify auth test suite
2. Run tests and capture outputs
3. Validate stability (run 3 times)
4. Store baseline with version
5. Update manifest

Output:
"Baseline Created: functional-auth-v1

Scope: Authentication tests (45 tests)
Format: JSON snapshots
Location: .aiwg/testing/baselines/functional/auth/

Files:
- login-v1.json
- logout-v1.json
- register-v1.json
- token-refresh-v1.json

Status: Active
Git Commit: abc123
Release: v1.2.0

Next: Baseline will be used for regression detection"
```

### Update Existing Baseline

```
User: "Update baseline for dashboard UI"

Skill checks current vs baseline:
"Comparing to visual-dashboard-v1:
- 5 screenshots differ
- Changes: Color scheme updated

Approve changes?
- dashboard-desktop.png
- dashboard-mobile.png
- widget-panel.png

Reason for update?"

User: "Design system v2 rollout"

Skill creates:
"Baseline Updated: visual-dashboard-v2

Previous: visual-dashboard-v1 (archived)
Changes: Design system v2 colors
Approved by: [current user]
Files updated: 5

New baseline active for regression tests"
```

### Compare to Baseline

```
User: "Compare current state to baseline"

Skill analyzes:
"Baseline Comparison: functional-auth-v1

Status: ⚠️ Drift Detected

Matches: 43/45 tests (95.6%)
Drifts: 2 tests show differences

Drift 1: user-login
  - New field 'name' in response
  - Severity: Medium
  - Likely intentional (commit abc123)

Drift 2: invalid-email handling
  - Returns 500 instead of 400
  - Severity: High
  - Likely regression (needs fix)

See full report: .aiwg/testing/baseline-comparison.md"
```

## Integration

This skill uses:
- `regression-bisect`: Find commits that broke baseline
- `regression-metrics`: Track baseline drift over time
- `test-coverage`: Ensure baselines cover critical paths
- `project-awareness`: Detect test framework and conventions

## Agent Orchestration

```yaml
agents:
  creation:
    agent: test-engineer
    focus: Baseline capture and validation

  analysis:
    agent: test-architect
    focus: Drift analysis and recommendations

  approval:
    agent: tech-lead
    focus: Baseline update approval
```

## Configuration

### Baseline Storage

```yaml
baseline_config:
  storage_path: .aiwg/testing/baselines/
  structure:
    - functional/
    - visual/
    - performance/
    - api/

  versioning: semantic  # v1, v2, v3
  compression: gzip     # for large baselines
  retention: 10         # keep last 10 versions
```

### Drift Thresholds

```yaml
drift_thresholds:
  functional:
    exact_match_required: true
    allow_new_fields: false  # strict schema matching

  visual:
    pixel_diff_threshold: 0.1%  # 0.1% difference allowed
    ignore_areas: [timestamp, dynamic-content]

  performance:
    p50_tolerance: 10%   # ±10% from baseline
    p95_tolerance: 15%   # ±15% from baseline
    p99_tolerance: 20%   # ±20% from baseline
```

### Approval Requirements

```yaml
approval_config:
  require_approval_for:
    - baseline_creation: true
    - baseline_updates: true
    - baseline_deletion: true

  approvers:
    - tech-lead
    - architect
    - test-lead

  approval_via:
    - pr_review
    - issue_comment
    - command_flag: --approved-by
```

## Output Locations

- Baselines: `.aiwg/testing/baselines/{type}/`
- Manifest: `.aiwg/testing/baselines/manifest.yaml`
- Comparisons: `.aiwg/testing/baseline-comparisons/`
- Archived: `.aiwg/testing/baselines/archive/`

## References

- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/schemas/test/baseline-schema.yaml
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/commands/baseline-create.md
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/agents/test-engineer.md

Related Skills

regression-visual

from jmagly/aiwg

Detect visual and UI regressions through screenshot comparison and pixel-diff analysis across browsers and viewports

regression-report

from jmagly/aiwg

Generate comprehensive regression analysis reports combining bisect, baseline, and metrics data with actionable recommendations

regression-performance

from jmagly/aiwg

Detect performance regressions by comparing benchmarks across versions with latency, throughput, and statistical significance analysis

regression-metrics

from jmagly/aiwg

Track and analyze regression statistics, trends, hotspots, and health indicators across test suites

regression-learning

from jmagly/aiwg

Improve regression detection over time through cross-task pattern recognition, test prioritization, and historical analysis

regression-cicd-hooks

from jmagly/aiwg

Integrate regression testing into CI/CD pipelines with baseline comparison and merge blocking on failure

regression-check

from jmagly/aiwg

Compare current behavior against baseline to detect regressions

regression-bisect

from jmagly/aiwg

Identify the commit that introduced a regression using git bisect with automated test execution and blame context

aiwg-orchestrate

from jmagly/aiwg

Route structured artifact work to AIWG workflows via MCP with zero parent context cost

venv-manager

from jmagly/aiwg

Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.

pytest-runner

from jmagly/aiwg

Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.

vitest-runner

from jmagly/aiwg

Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.