judge-prompt
Design binary pass/fail LLM-as-Judge evaluators. Structured prompt engineering for evaluation: criteria definition, rubric construction, few-shot calibration, and bias mitigation. Produces a ready-to-deploy judge prompt with scoring instructions. Triggers on: "judge prompt", "llm judge", "evaluator prompt", "scoring prompt", "grading rubric"
Best use case
judge-prompt is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Design binary pass/fail LLM-as-Judge evaluators. Structured prompt engineering for evaluation: criteria definition, rubric construction, few-shot calibration, and bias mitigation. Produces a ready-to-deploy judge prompt with scoring instructions. Triggers on: "judge prompt", "llm judge", "evaluator prompt", "scoring prompt", "grading rubric"
Teams using judge-prompt should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/judge-prompt/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How judge-prompt Compares
| Feature / Agent | judge-prompt | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Design binary pass/fail LLM-as-Judge evaluators. Structured prompt engineering for evaluation: criteria definition, rubric construction, few-shot calibration, and bias mitigation. Produces a ready-to-deploy judge prompt with scoring instructions. Triggers on: "judge prompt", "llm judge", "evaluator prompt", "scoring prompt", "grading rubric"
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# /judge-prompt
> Design binary pass/fail LLM-as-Judge evaluation prompts.
## Purpose
Create a rigorous LLM-as-Judge prompt that evaluates model outputs with binary pass/fail decisions. Walks through criteria definition, rubric construction, few-shot example selection, bias mitigation, and prompt assembly. The output is a self-contained judge prompt ready for deployment in an eval pipeline, with built-in guardrails against common judge biases (position, verbosity, self-preference).
## Usage
```bash
# Interactive judge design
/judge-prompt --task "summarization quality"
# From criteria spec
/judge-prompt --criteria criteria.yaml --examples examples.jsonl
# Design judge for specific eval
/judge-prompt --eval evals/code-review/ --task "code correctness"
# Add bias mitigation
/judge-prompt --task "helpfulness" --mitigate position,verbosity
# Generate judge with calibration examples
/judge-prompt --task "factual accuracy" --calibrate labels/human-ratings.jsonl
```
## Arguments
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--task` | string | required | What the judge evaluates (e.g., "summarization quality") |
| `--criteria` | string | — | Path to YAML criteria specification |
| `--examples` | string | — | Path to JSONL file with labeled examples for few-shot |
| `--eval` | string | — | Path to eval directory for context |
| `--mitigate` | string | `all` | Biases to mitigate: `position`, `verbosity`, `self-preference`, `all`, `none` |
| `--calibrate` | string | — | Human labels file for calibration examples |
| `--output` | string | `judge-prompt.md` | Output path for the judge prompt |
| `--format` | enum | `markdown` | Output format: `markdown`, `yaml`, `json` |
| `--style` | enum | `binary` | Judgment style: `binary` (pass/fail), `likert` (1-5), `comparative` (A vs B) |
## Workflow
1. **Define task** — Clarify what the judge evaluates. What does "good" look like? What are the failure modes? What is the minimum quality bar for a pass?
2. **Criteria specification** — Break the task into 2-5 concrete, observable criteria. Each criterion must be: specific (not vague), binary-testable (can answer yes/no), independent (not redundant with others). Example: for summarization — completeness, accuracy, conciseness, coherence.
3. **Rubric construction** — For each criterion, write explicit pass and fail descriptions with boundary examples. Define what a borderline case looks like and which side it falls on. Eliminate ambiguity.
4. **Few-shot examples** — Select 3-5 calibration examples: 1-2 clear passes, 1-2 clear fails, and 1 borderline case with explanation. If `--calibrate` is provided, select examples aligned with human labels.
5. **Bias mitigation** — Add structural safeguards. Position bias: randomize presentation order or require evaluation before seeing alternatives. Verbosity bias: instruct to judge content not length. Self-preference: use a different model family for judging.
6. **Prompt assembly** — Compile the judge prompt with: role definition, task description, criteria with rubric, few-shot examples, output format specification (structured JSON with verdict + reasoning), and bias mitigation instructions.
7. **Validation check** — Self-test the prompt against the few-shot examples. Verify it produces correct verdicts. Flag any inconsistencies.
## Examples
### Designing a summarization judge
```
/judge-prompt --task "summarization quality" --mitigate all
## Judge Prompt — Summarization Quality
### Criteria
1. **Completeness**: Summary captures all key points from the source
2. **Accuracy**: No facts are distorted, added, or misrepresented
3. **Conciseness**: No unnecessary repetition or filler
4. **Coherence**: Summary reads naturally as standalone text
### Rubric
| Criterion | PASS | FAIL | Borderline |
|-----------|------|------|------------|
| Completeness | All main points present | Missing ≥1 key point | Minor supporting detail missing → PASS |
| Accuracy | All facts match source | Any factual error | Imprecise wording without meaning change → PASS |
| Conciseness | No redundancy | Repeats same point 2+ times | Slightly verbose but no repetition → PASS |
| Coherence | Flows naturally | Disjointed or contradictory | Awkward transition → PASS if meaning clear |
### Verdict Rule
- PASS: All 4 criteria pass
- FAIL: Any criterion fails
### Generated Prompt
```
You are an evaluation judge. Assess the following summary against its source document.
[Criteria and rubric inserted here...]
Evaluate each criterion independently. Output your judgment as:
{"verdict": "pass" | "fail", "criteria": {"completeness": bool, "accuracy": bool, "conciseness": bool, "coherence": bool}, "reasoning": "one sentence explanation"}
IMPORTANT: Judge the content, not the style. A shorter summary that captures all key points is equally valid as a longer one. Do not penalize conciseness. Do not reward verbosity.
```
## Output
```markdown
## Judge Prompt — <task>
### Criteria
1. <criterion>: <description>
...
### Rubric
| Criterion | PASS | FAIL | Borderline → |
|-----------|------|------|--------------|
| ... | ... | ... | ... |
### Few-Shot Examples
#### Example 1 (PASS): ...
#### Example 2 (FAIL): ...
#### Example 3 (BORDERLINE → PASS): ...
### Bias Mitigations
- <mitigation strategy>
### Prompt (ready to deploy)
```text
[Complete prompt text]
```
### Validation
- Few-shot self-test: N/N correct
```
## Dependencies
- Task definition (provided via `--task` or `--criteria`)
- `/validate-evaluator` — Downstream calibration against human labels
- `/synthetic-data` — Upstream if test cases are needed for calibration
- `/error-analysis` — Downstream if judge performance needs debuggingRelated Skills
prompt-cache-optimizer
Optimize token usage through prompt caching and compression
meta-prompting
Self-improving prompts through meta-level optimization
/do
> The agent's primary skill. Customize this to match your agent's purpose.
/report
> Generate structured reports. Director-owned.
/primary
> Main workflow execution and routing. Director-owned.
Qualify
## Command
Prospect
## Command
Close Plan
## Command
Battlecard
## Command
Spec
## Command
Schedule
## Command
Repurpose
## Command