reproducibility-validate
Run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict
Best use case
reproducibility-validate is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
It is a strong fit for teams already working in Codex.
Run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict
Teams using reproducibility-validate should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/reproducibility-validate/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How reproducibility-validate Compares
| Feature / Agent | reproducibility-validate | Standard Approach |
|---|---|---|
| Platform Support | Codex | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict
Which AI agents support this skill?
This skill is designed for Codex.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
SKILL.md Source
# Reproducibility Validate You run a workflow multiple times and compare outputs to produce a similarity score and pass/fail verdict, confirming that the workflow produces consistent results across executions. ## Triggers Alternate expressions and non-obvious activations (primary phrases are matched automatically from the skill description): - "is this workflow stable" → run reproducibility validation with defaults - "check if results are consistent" → run reproducibility validation - "does this run the same way every time" → run reproducibility validation - "test determinism" → run reproducibility validation - "compare workflow outputs" → run reproducibility validation ## Trigger Patterns Reference | Pattern | Example | Action | |---------|---------|--------| | Default validation | "validate reproducibility of onboarding-flow" | Run `aiwg reproducibility-validate onboarding-flow` | | Custom run count | "validate with 5 runs" | Run `aiwg reproducibility-validate <id> --runs 5` | | Custom threshold | "validate with 99% threshold" | Run `aiwg reproducibility-validate <id> --threshold 0.99` | | Full options | "3 runs, 90% threshold" | Run `aiwg reproducibility-validate <id> --runs 3 --threshold 0.90` | ## Behavior When triggered: 1. **Extract intent**: - What is the workflow ID or name to validate? - How many runs? (default: 3) - What similarity threshold must be met to pass? (default: 0.95) 2. **Run the command**: ```bash # Default: 3 runs, 0.95 threshold aiwg reproducibility-validate <workflow-id> # Custom run count aiwg reproducibility-validate <workflow-id> --runs 5 # Custom threshold aiwg reproducibility-validate <workflow-id> --threshold 0.99 # Both aiwg reproducibility-validate <workflow-id> --runs 5 --threshold 0.99 ``` 3. **Scoring method**: - **Structured outputs** (JSON, YAML): exact match required — score is 0 or 1 per artifact - **Text outputs** (Markdown, prose): semantic similarity score (0.0–1.0) computed across runs - **Overall score**: weighted average across all artifacts in the workflow 4. **Pass/fail verdict**: If all per-artifact scores meet or exceed the threshold, the validation passes. Any artifact below the threshold is flagged. 5. **Report the result** — provide the overall verdict, overall similarity score, and a per-artifact breakdown showing which artifacts passed or failed. ## Examples ### Example 1: Default validation **User**: "Validate reproducibility of the risk-management-cycle workflow" **Extraction**: Workflow ID `risk-management-cycle`, default 3 runs, default 0.95 threshold **Action**: ```bash aiwg reproducibility-validate risk-management-cycle ``` **Response**: ``` Reproducibility Validation: risk-management-cycle Runs: 3 | Threshold: 0.95 | Result: PASS Artifact Score Status .aiwg/risks/risk-register.md 0.97 PASS .aiwg/risks/mitigation-plan.md 0.96 PASS .aiwg/working/risk-analysis-draft.md 1.00 PASS (exact match) Overall score: 0.977 — all 3 artifacts passed. ``` ### Example 2: Stricter threshold **User**: "Validate the SAD generation workflow with 99% threshold and 5 runs" **Extraction**: Workflow ID `sad-generation`, 5 runs, 0.99 threshold **Action**: ```bash aiwg reproducibility-validate sad-generation --runs 5 --threshold 0.99 ``` **Response**: ``` Reproducibility Validation: sad-generation Runs: 5 | Threshold: 0.99 | Result: FAIL Artifact Score Status .aiwg/architecture/software-architecture.md 0.98 FAIL .aiwg/architecture/component-diagram.mmd 1.00 PASS (exact match) .aiwg/architecture/adr-001.md 0.99 PASS Overall score: 0.990 — 1 of 3 artifacts failed the 0.99 threshold. Recommendation: Review non-deterministic sections in software-architecture.md, or lower the threshold to 0.95 if semantic variation is acceptable. ``` ### Example 3: High-confidence validation before release **User**: "Run 5 iterations of the deployment-planning workflow and check for consistency" **Extraction**: Workflow ID `deployment-planning`, 5 runs, default 0.95 threshold **Action**: ```bash aiwg reproducibility-validate deployment-planning --runs 5 ``` **Response**: ``` Reproducibility Validation: deployment-planning Runs: 5 | Threshold: 0.95 | Result: PASS Artifact Score Status .aiwg/deployment/deployment-plan.md 0.96 PASS .aiwg/deployment/runbook.md 0.97 PASS .aiwg/deployment/rollback-plan.md 0.95 PASS Overall score: 0.960 — all 3 artifacts passed across 5 runs. ``` ### Example 4: Validating a failing workflow **User**: "Test whether the test-strategy workflow is deterministic" **Extraction**: Workflow ID `test-strategy-execution`, default 3 runs, default 0.95 threshold **Action**: ```bash aiwg reproducibility-validate test-strategy-execution ``` **Response**: ``` Reproducibility Validation: test-strategy-execution Runs: 3 | Threshold: 0.95 | Result: FAIL Artifact Score Status .aiwg/testing/test-plan.md 0.94 FAIL .aiwg/testing/test-strategy.md 0.97 PASS .aiwg/testing/coverage-targets.json 1.00 PASS (exact match) Overall score: 0.970 — 1 of 3 artifacts failed the 0.95 threshold. Recommendation: Enable `strict` execution mode (`aiwg execution-mode strict`) to reduce variance in test-plan.md, then re-validate. ``` ## Clarification Prompts If the user's intent is ambiguous: - "Which workflow should I validate? (e.g., `risk-management-cycle`, `sad-generation`)" - "How many runs would you like? Default is 3; more runs give higher confidence but take longer." - "What similarity threshold should I apply? Default is 0.95. Use 0.99 for near-exact determinism requirements." ## References - @$AIWG_ROOT/src/cli/handlers/subcommands.ts — Reproducibility validate command handler - @$AIWG_ROOT/docs/cli-reference.md — CLI reference - @$AIWG_ROOT/agentic/code/addons/aiwg-utils/skills/execution-mode/SKILL.md — Set execution mode to reduce variance before validating - @$AIWG_ROOT/agentic/code/addons/aiwg-utils/skills/snapshot/SKILL.md — Capture state before running validation
Related Skills
validate-metadata
Validate AIWG extension definitions against the metadata schema and report errors with field names, line numbers, and remediation hints
validate-component
Validate a single AIWG component (skill, agent, or command) for completeness and correctness
validate-addon
Validate an entire AIWG addon package for completeness and release readiness
soul-validate
Validate a SOUL.md file against community best practices and quality criteria
setup-validate
Validate a `setup.aiwg.io/v1` SetupManifest file against the schema and run cons
provenance-validate
Validate provenance records and chains for completeness and consistency
prose-validate
Validate an OpenProse program file against Prose contract grammar without executing it — checks frontmatter, contract structure, service references, and strategy syntax
mention-validate
Validate all @-mentions resolve to existing files
devkit-validate
Validate addon, framework, or extension structure and manifest
contract-validate
Validate that a chain of AIWG skills has all requires: inputs satisfied by upstream ensures: outputs before execution. Catches missing dependencies at wiring time rather than at runtime.
aiwg-orchestrate
Route structured artifact work to AIWG workflows via MCP with zero parent context cost
venv-manager
Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.