AI Agent Skill HUB

Codex

reproducibility-validate

Run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict

104 stars

View on GitHub Installation ↓

Best use case

reproducibility-validate is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

It is a strong fit for teams already working in Codex.

Run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict

Teams using reproducibility-validate should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/reproducibility-validate/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/.agents/skills/reproducibility-validate/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/reproducibility-validate/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How reproducibility-validate Compares

Feature / Agent	reproducibility-validate	Standard Approach
Platform Support	Codex	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict

Which AI agents support this skill?

This skill is designed for Codex.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

Top AI Agents for Productivity

See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.

SKILL.md Source

# Reproducibility Validate

You run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict, confirming that the workflow produces consistent results across executions.

## Triggers

Alternate expressions and non-obvious activations (primary phrases are matched automatically from the skill description):

- "is this workflow stable" → run reproducibility validation with defaults
- "check if results are consistent" → run reproducibility validation
- "does this run the same way every time" → run reproducibility validation
- "test determinism" → run reproducibility validation
- "compare workflow outputs" → run reproducibility validation

## Trigger Patterns Reference

| Pattern | Example | Action |
|---------|---------|--------|
| Default validation | "validate reproducibility of onboarding-flow" | Run `aiwg reproducibility-validate onboarding-flow` |
| Custom run count | "validate with 5 runs" | Run `aiwg reproducibility-validate <id> --runs 5` |
| Custom threshold | "validate with 99% threshold" | Run `aiwg reproducibility-validate <id> --threshold 0.99` |
| Full options | "3 runs, 90% threshold" | Run `aiwg reproducibility-validate <id> --runs 3 --threshold 0.90` |

## Behavior

When triggered:

1. **Extract intent**:
   - What is the workflow ID or name to validate?
   - How many runs? (default: 3)
   - What similarity threshold must be met to pass? (default: 0.95)

2. **Run the command**:

   ```bash
   # Default: 3 runs, 0.95 threshold
   aiwg reproducibility-validate <workflow-id>

   # Custom run count
   aiwg reproducibility-validate <workflow-id> --runs 5

   # Custom threshold
   aiwg reproducibility-validate <workflow-id> --threshold 0.99

   # Both
   aiwg reproducibility-validate <workflow-id> --runs 5 --threshold 0.99
   ```

3. **Scoring method**:
   - **Structured outputs** (JSON, YAML): exact match required — score is 0 or 1 per artifact
   - **Text outputs** (Markdown, prose): semantic similarity score (0.0–1.0) computed across runs
   - **Overall score**: weighted average across all artifacts in the workflow

4. **Pass/fail verdict**: If all per-artifact scores meet or exceed the threshold, the validation passes. Any artifact below the threshold is flagged.

5. **Report the result** — provide the overall verdict, overall similarity score, and a per-artifact breakdown showing which artifacts passed or failed.

## Examples

### Example 1: Default validation

**User**: "Validate reproducibility of the risk-management-cycle workflow"

**Extraction**: Workflow ID `risk-management-cycle`, default 3 runs, default 0.95 threshold

**Action**:
```bash
aiwg reproducibility-validate risk-management-cycle
```

**Response**:
```
Reproducibility Validation: risk-management-cycle
Runs: 3 | Threshold: 0.95 | Result: PASS

Artifact                                  Score    Status
.aiwg/risks/risk-register.md             0.97     PASS
.aiwg/risks/mitigation-plan.md           0.96     PASS
.aiwg/working/risk-analysis-draft.md     1.00     PASS (exact match)

Overall score: 0.977 — all 3 artifacts passed.
```

### Example 2: Stricter threshold

**User**: "Validate the SAD generation workflow with 99% threshold and 5 runs"

**Extraction**: Workflow ID `sad-generation`, 5 runs, 0.99 threshold

**Action**:
```bash
aiwg reproducibility-validate sad-generation --runs 5 --threshold 0.99
```

**Response**:
```
Reproducibility Validation: sad-generation
Runs: 5 | Threshold: 0.99 | Result: FAIL

Artifact                                       Score    Status
.aiwg/architecture/software-architecture.md   0.98     FAIL
.aiwg/architecture/component-diagram.mmd      1.00     PASS (exact match)
.aiwg/architecture/adr-001.md                 0.99     PASS

Overall score: 0.990 — 1 of 3 artifacts failed the 0.99 threshold.
Recommendation: Review non-deterministic sections in software-architecture.md,
or lower the threshold to 0.95 if semantic variation is acceptable.
```

### Example 3: High-confidence validation before release

**User**: "Run 5 iterations of the deployment-planning workflow and check for consistency"

**Extraction**: Workflow ID `deployment-planning`, 5 runs, default 0.95 threshold

**Action**:
```bash
aiwg reproducibility-validate deployment-planning --runs 5
```

**Response**:
```
Reproducibility Validation: deployment-planning
Runs: 5 | Threshold: 0.95 | Result: PASS

Artifact                                  Score    Status
.aiwg/deployment/deployment-plan.md      0.96     PASS
.aiwg/deployment/runbook.md              0.97     PASS
.aiwg/deployment/rollback-plan.md        0.95     PASS

Overall score: 0.960 — all 3 artifacts passed across 5 runs.
```

### Example 4: Validating a failing workflow

**User**: "Test whether the test-strategy workflow is deterministic"

**Extraction**: Workflow ID `test-strategy-execution`, default 3 runs, default 0.95 threshold

**Action**:
```bash
aiwg reproducibility-validate test-strategy-execution
```

**Response**:
```
Reproducibility Validation: test-strategy-execution
Runs: 3 | Threshold: 0.95 | Result: FAIL

Artifact                               Score    Status
.aiwg/testing/test-plan.md            0.94     FAIL
.aiwg/testing/test-strategy.md        0.97     PASS
.aiwg/testing/coverage-targets.json   1.00     PASS (exact match)

Overall score: 0.970 — 1 of 3 artifacts failed the 0.95 threshold.
Recommendation: Enable `strict` execution mode (`aiwg execution-mode strict`)
to reduce variance in test-plan.md, then re-validate.
```

## Clarification Prompts

If the user's intent is ambiguous:

- "Which workflow should I validate? (e.g., `risk-management-cycle`, `sad-generation`)"
- "How many runs would you like? Default is 3; more runs give higher confidence but take longer."
- "What similarity threshold should I apply? Default is 0.95. Use 0.99 for near-exact determinism requirements."

## References

- @$AIWG_ROOT/src/cli/handlers/subcommands.ts — Reproducibility validate command handler
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/skills/execution-mode/SKILL.md — Set execution mode to reduce variance before validating
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/skills/snapshot/SKILL.md — Capture state before running validation

Related Skills

validate-metadata

from jmagly/aiwg

Validate AIWG extension definitions against the metadata schema and report errors with field names, line numbers, and remediation hints

validate-component

from jmagly/aiwg

Validate a single AIWG component (skill, agent, or command) for completeness and correctness

validate-addon

from jmagly/aiwg

Validate an entire AIWG addon package for completeness and release readiness

soul-validate

from jmagly/aiwg

Validate a SOUL.md file against community best practices and quality criteria

setup-validate

from jmagly/aiwg

Validate a `setup.aiwg.io/v1` SetupManifest file against the schema and run cons

provenance-validate

from jmagly/aiwg

Validate provenance records and chains for completeness and consistency

prose-validate

from jmagly/aiwg

Validate an OpenProse program file against Prose contract grammar without executing it — checks frontmatter, contract structure, service references, and strategy syntax

mention-validate

from jmagly/aiwg

Validate all @-mentions resolve to existing files

devkit-validate

from jmagly/aiwg

Validate addon, framework, or extension structure and manifest

contract-validate

from jmagly/aiwg

Validate that a chain of AIWG skills has all requires: inputs satisfied by upstream ensures: outputs before execution. Catches missing dependencies at wiring time rather than at runtime.

aiwg-orchestrate

from jmagly/aiwg

Route structured artifact work to AIWG workflows via MCP with zero parent context cost

venv-manager

from jmagly/aiwg

Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.