validate-guidelines

Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

validate-guidelines is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using validate-guidelines should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/validate-guidelines/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/tools/validate-guidelines/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/validate-guidelines/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How validate-guidelines Compares

Feature / Agent	validate-guidelines	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Validate Guidelines

## When to use

- User proposes or has made changes to `runner/models/guidelines.ts` and wants to ensure they don't regress other models
- User says "validate the guideline changes" or "run the guideline validation"
- Before committing guideline edits, to confirm improvements or no-regression across sonnet, opus, gemini, chatgpt (or a subset)

## Overview

Guideline changes are validated by running evals twice per model: once with the **current** (before) guidelines and once with the **proposed** (after) guidelines. Results are compared; any eval that passed before and fails after is a regression. The goal is to ensure changes improve or at least do not regress scores across multiple models.

## Step 1: Identify the change

Determine which guideline sections were modified in `runner/models/guidelines.ts` (e.g. `function_guidelines`, `query_guidelines`, `file_storage_guidelines`) and the intent (new rule, clarification, token compaction).

## Step 2: Build before and after guideline files

- **Before**: Current committed guidelines. Generate by running `bun run buildRelease.ts` and use `dist/AGENTS.md`, or render compact guidelines to a temp file. If the repo is in a clean state, `dist/AGENTS.md` after build is the "before" snapshot.
- **After**: Guidelines with the proposed changes. Either:
  - Temporarily apply the proposed edits to `runner/models/guidelines.ts`, run `bun run buildRelease.ts`, copy `dist/AGENTS.md` to a temp path (e.g. `guideline-validation/after.md`), then revert the file; or
  - Write the proposed full guideline markdown to a temp file (e.g. by building from a branch or a copy of the file).

Ensure both paths are absolute or relative to the repo root and that the script can read them.

## Step 3: Select target evals

Use the mapping below to choose a `--filter` regex or omit it for the full suite.

| Guideline section | Suggested TEST_FILTER (regex) |
|-------------------|--------------------------------|
| `function_guidelines` (http, validators, registration, calling, pagination) | `000-fundamentals\|006-clients` or full |
| `validator_guidelines` | `000-fundamentals/009` |
| `schema_guidelines` | `001-data_modeling` |
| `typescript_guidelines` | Omit (run all) |
| `full_text_search_guidelines` | `002-queries/009\|002-queries/020` |
| `query_guidelines` | `002-queries` |
| `mutation_guidelines` | `003-mutations` |
| `action_guidelines` | `004-actions` |
| `scheduling_guidelines` | `000-fundamentals/003\|000-fundamentals/004` |
| `file_storage_guidelines` | `000-fundamentals/007\|004-actions/004\|004-actions/005` |

- **Targeted change** (e.g. one section): use a filter that matches the evals most likely affected.
- **Broad change** (e.g. wording across many sections): omit `--filter` to run all evals.

## Step 4: Select models

Default set (preferred for validation): `claude-sonnet-4-5`, `claude-opus-4-6`, `gemini-3-pro-preview`, `gpt-5.2-codex`.

Check which API keys are set in `.env` (e.g. `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`). The script skips models whose provider key is missing and prints a warning. Use a subset if some keys are unavailable; at least two models are recommended.

## Step 5: Run the validation script and monitor to completion

Do **not** set `CONVEX_EVAL_URL` or `CONVEX_AUTH_TOKEN` so results stay local.

```bash
bun run validate:guidelines --before <path-to-before.md> --after <path-to-after.md> --models claude-sonnet-4-5,claude-opus-4-6,gemini-3-pro-preview,gpt-5.2-codex
```

With an eval filter:

```bash
bun run validate:guidelines --before <before.md> --after <after.md> --models claude-sonnet-4-5,gpt-5.2-codex --filter "002-queries"
```

Optional: `--output <path>` to write the JSON summary to a specific file. By default it is written to `guideline-validation/results/<timestamp>.json`.

The script runs each model sequentially: first all evals with "before" guidelines, then all evals with "after" guidelines. Pass/fail is collected and deltas are computed.

**IMPORTANT: You must orchestrate the entire run end-to-end.** Start the command in the background (`block_until_ms: 0`), then poll the terminal output file periodically until the run finishes (look for the `exit_code` footer or the `GUIDELINE VALIDATION SUMMARY` banner). Use exponential backoff for polling (e.g. 30s, 60s, 120s). Do NOT return to the user until the run is fully complete and you have read and analyzed the results. The user expects a complete report, not a "check back later" handoff.

## Step 6: Parse and report results

The script prints:

1. A **comparison table**: per-model before pass count, after pass count, delta, number of regressions, number of improvements.
2. **Regressions**: evals that passed before and failed after (by model).
3. **Improvements**: evals that failed before and passed after (by model).
4. A **verdict** line: either "REGRESSIONS DETECTED" or "Safe to commit."

Read the script output and present the full summary table and verdict to the user.

- If there are regressions: list them and recommend reverting or narrowing the guideline change; optionally run analyze-eval on a regression to see why it failed.
- If there are no regressions: recommend committing the guideline change; mention any improvements.

## Step 7: Recommend next steps

- **No regressions, with or without improvements**: Safe to commit the guideline changes.
- **Any regressions**: Do not commit. Suggest reverting the change or narrowing it (e.g. only add the new rule to a subsection that doesn’t affect the regressed eval). Re-run validation after adjusting.
- **Unclear or noisy**: If only one model regresses one eval, consider re-running that model to check for flakiness, or run the full suite once more.

## Reference: Script usage

```
bun run validate:guidelines --before <path> --after <path> --models <m1,m2,...> [--filter <regex>] [--output <path>]
```

- `--before`, `--after`: Paths to guideline markdown files (current vs proposed).
- `--models`: Comma-separated model names from `runner/models/index.ts` (e.g. `gpt-5.2-codex`, `claude-sonnet-4-5`).
- `--filter`: Optional regex on eval `category/name` (e.g. `005-idioms` or `002-queries/015`).
- `--output`: Optional path for the JSON summary file.

API keys are loaded from `.env` via dotenv (see AGENTS.md). The script does not report to Convex.

Related Skills

Plexus Classifier Guidelines Management

from diegosouzapw/awesome-omni-skill

The format for guidelines documents for Plexus scorecard scores and the validation tool.

github.com/n-r-w/ctxlog guidelines

from diegosouzapw/awesome-omni-skill

Guidelines and examples for using the ctxlog logging package.

clack-guidelines

from diegosouzapw/awesome-omni-skill

Comprehensive guide for building beautiful interactive command-line interfaces using Clack. Use when creating CLI tools with text input, selections, autocomplete, progress tracking, and streaming output.

agents-md-guidelines

from diegosouzapw/awesome-omni-skill

Guidelines for writing small, stable AGENTS.md files. Use when creating, refactoring, or reviewing AGENTS.md.

agent-guidelines

from diegosouzapw/awesome-omni-skill

When you need to understand the project's core mandate, operational rules, or "Constitution". Use this skill to align with the project's identity and strict coding standards.

62-validate-integrity-150

from diegosouzapw/awesome-omni-skill

[62] VALIDATE. Final self-check before delivery. Verify goal alignment, completeness, correctness, and identify residual risks. Produces quality score (0-100) and delivery status. Use when completing any significant work, before handoff, or when you need confidence that work is ready.

camel-validate

from diegosouzapw/awesome-omni-skill

Validate routes when user wants to check YAML syntax, verify security compliance, analyze route quality, find issues, perform security hardening, or ensure best practices

60-validate-tests-150

from diegosouzapw/awesome-omni-skill

[60] VALIDATE. Ensure new (staged and unstaged) changes are covered by tests at >70% and the full test suite is green. Use when asked to validate coverage for recent changes, add tests for modified code, or verify nothing else broke.

documentation-guidelines

from diegosouzapw/awesome-omni-skill

Write or update backend feature documentation that follows a repo's DOCUMENTATION_GUIDELINES.md (or equivalent) across any project. Use when asked to create/update module docs, API contracts, or backend documentation that must include architecture, endpoints, payloads, Mermaid diagrams, and seeding instructions.

developer-guidelines

from diegosouzapw/awesome-omni-skill

Guidelines for the Developer role: strict adherence, no unsolicited refactoring, documentation, security.

artifact-guidelines

from diegosouzapw/awesome-omni-skill

Guidelines for writing reports, organizing files, and generating code artifacts

deployment-validation-config-validate

from diegosouzapw/awesome-omni-skill

You are a configuration management expert specializing in validating, testing, and ensuring the correctness of application configurations. Create comprehensive validation schemas, implement configurat