Codex

reproducibility-validate

Run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict

104 stars

Best use case

reproducibility-validate is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

It is a strong fit for teams already working in Codex.

Run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict

Teams using reproducibility-validate should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/reproducibility-validate/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/.agents/skills/reproducibility-validate/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/reproducibility-validate/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How reproducibility-validate Compares

Feature / Agentreproducibility-validateStandard Approach
Platform SupportCodexLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict

Which AI agents support this skill?

This skill is designed for Codex.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Reproducibility Validate

You run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict, confirming that the workflow produces consistent results across executions.

## Triggers

Alternate expressions and non-obvious activations (primary phrases are matched automatically from the skill description):

- "is this workflow stable" → run reproducibility validation with defaults
- "check if results are consistent" → run reproducibility validation
- "does this run the same way every time" → run reproducibility validation
- "test determinism" → run reproducibility validation
- "compare workflow outputs" → run reproducibility validation

## Trigger Patterns Reference

| Pattern | Example | Action |
|---------|---------|--------|
| Default validation | "validate reproducibility of onboarding-flow" | Run `aiwg reproducibility-validate onboarding-flow` |
| Custom run count | "validate with 5 runs" | Run `aiwg reproducibility-validate <id> --runs 5` |
| Custom threshold | "validate with 99% threshold" | Run `aiwg reproducibility-validate <id> --threshold 0.99` |
| Full options | "3 runs, 90% threshold" | Run `aiwg reproducibility-validate <id> --runs 3 --threshold 0.90` |

## Behavior

When triggered:

1. **Extract intent**:
   - What is the workflow ID or name to validate?
   - How many runs? (default: 3)
   - What similarity threshold must be met to pass? (default: 0.95)

2. **Run the command**:

   ```bash
   # Default: 3 runs, 0.95 threshold
   aiwg reproducibility-validate <workflow-id>

   # Custom run count
   aiwg reproducibility-validate <workflow-id> --runs 5

   # Custom threshold
   aiwg reproducibility-validate <workflow-id> --threshold 0.99

   # Both
   aiwg reproducibility-validate <workflow-id> --runs 5 --threshold 0.99
   ```

3. **Scoring method**:
   - **Structured outputs** (JSON, YAML): exact match required — score is 0 or 1 per artifact
   - **Text outputs** (Markdown, prose): semantic similarity score (0.0–1.0) computed across runs
   - **Overall score**: weighted average across all artifacts in the workflow

4. **Pass/fail verdict**: If all per-artifact scores meet or exceed the threshold, the validation passes. Any artifact below the threshold is flagged.

5. **Report the result** — provide the overall verdict, overall similarity score, and a per-artifact breakdown showing which artifacts passed or failed.

## Examples

### Example 1: Default validation

**User**: "Validate reproducibility of the risk-management-cycle workflow"

**Extraction**: Workflow ID `risk-management-cycle`, default 3 runs, default 0.95 threshold

**Action**:
```bash
aiwg reproducibility-validate risk-management-cycle
```

**Response**:
```
Reproducibility Validation: risk-management-cycle
Runs: 3 | Threshold: 0.95 | Result: PASS

Artifact                                  Score    Status
.aiwg/risks/risk-register.md             0.97     PASS
.aiwg/risks/mitigation-plan.md           0.96     PASS
.aiwg/working/risk-analysis-draft.md     1.00     PASS (exact match)

Overall score: 0.977 — all 3 artifacts passed.
```

### Example 2: Stricter threshold

**User**: "Validate the SAD generation workflow with 99% threshold and 5 runs"

**Extraction**: Workflow ID `sad-generation`, 5 runs, 0.99 threshold

**Action**:
```bash
aiwg reproducibility-validate sad-generation --runs 5 --threshold 0.99
```

**Response**:
```
Reproducibility Validation: sad-generation
Runs: 5 | Threshold: 0.99 | Result: FAIL

Artifact                                       Score    Status
.aiwg/architecture/software-architecture.md   0.98     FAIL
.aiwg/architecture/component-diagram.mmd      1.00     PASS (exact match)
.aiwg/architecture/adr-001.md                 0.99     PASS

Overall score: 0.990 — 1 of 3 artifacts failed the 0.99 threshold.
Recommendation: Review non-deterministic sections in software-architecture.md,
or lower the threshold to 0.95 if semantic variation is acceptable.
```

### Example 3: High-confidence validation before release

**User**: "Run 5 iterations of the deployment-planning workflow and check for consistency"

**Extraction**: Workflow ID `deployment-planning`, 5 runs, default 0.95 threshold

**Action**:
```bash
aiwg reproducibility-validate deployment-planning --runs 5
```

**Response**:
```
Reproducibility Validation: deployment-planning
Runs: 5 | Threshold: 0.95 | Result: PASS

Artifact                                  Score    Status
.aiwg/deployment/deployment-plan.md      0.96     PASS
.aiwg/deployment/runbook.md              0.97     PASS
.aiwg/deployment/rollback-plan.md        0.95     PASS

Overall score: 0.960 — all 3 artifacts passed across 5 runs.
```

### Example 4: Validating a failing workflow

**User**: "Test whether the test-strategy workflow is deterministic"

**Extraction**: Workflow ID `test-strategy-execution`, default 3 runs, default 0.95 threshold

**Action**:
```bash
aiwg reproducibility-validate test-strategy-execution
```

**Response**:
```
Reproducibility Validation: test-strategy-execution
Runs: 3 | Threshold: 0.95 | Result: FAIL

Artifact                               Score    Status
.aiwg/testing/test-plan.md            0.94     FAIL
.aiwg/testing/test-strategy.md        0.97     PASS
.aiwg/testing/coverage-targets.json   1.00     PASS (exact match)

Overall score: 0.970 — 1 of 3 artifacts failed the 0.95 threshold.
Recommendation: Enable `strict` execution mode (`aiwg execution-mode strict`)
to reduce variance in test-plan.md, then re-validate.
```

## Clarification Prompts

If the user's intent is ambiguous:

- "Which workflow should I validate? (e.g., `risk-management-cycle`, `sad-generation`)"
- "How many runs would you like? Default is 3; more runs give higher confidence but take longer."
- "What similarity threshold should I apply? Default is 0.95. Use 0.99 for near-exact determinism requirements."

## References

- @$AIWG_ROOT/src/cli/handlers/subcommands.ts — Reproducibility validate command handler
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/skills/execution-mode/SKILL.md — Set execution mode to reduce variance before validating
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/skills/snapshot/SKILL.md — Capture state before running validation

Related Skills

validate-metadata

104
from jmagly/aiwg

Validate AIWG extension definitions against the metadata schema and report errors with field names, line numbers, and remediation hints

Codex

validate-component

104
from jmagly/aiwg

Validate a single AIWG component (skill, agent, or command) for completeness and correctness

Codex

validate-addon

104
from jmagly/aiwg

Validate an entire AIWG addon package for completeness and release readiness

Codex

soul-validate

104
from jmagly/aiwg

Validate a SOUL.md file against community best practices and quality criteria

Codex

setup-validate

104
from jmagly/aiwg

Validate a `setup.aiwg.io/v1` SetupManifest file against the schema and run cons

Codex

provenance-validate

104
from jmagly/aiwg

Validate provenance records and chains for completeness and consistency

Codex

prose-validate

104
from jmagly/aiwg

Validate an OpenProse program file against Prose contract grammar without executing it — checks frontmatter, contract structure, service references, and strategy syntax

Codex

mention-validate

104
from jmagly/aiwg

Validate all @-mentions resolve to existing files

Codex

devkit-validate

104
from jmagly/aiwg

Validate addon, framework, or extension structure and manifest

Codex

contract-validate

104
from jmagly/aiwg

Validate that a chain of AIWG skills has all requires: inputs satisfied by upstream ensures: outputs before execution. Catches missing dependencies at wiring time rather than at runtime.

Codex

aiwg-orchestrate

104
from jmagly/aiwg

Route structured artifact work to AIWG workflows via MCP with zero parent context cost

venv-manager

104
from jmagly/aiwg

Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.