sadd:judge

Launch a sub-agent judge to evaluate results produced in the current conversation

23 stars

bychristophacham

View on GitHub Installation ↓

Best use case

sadd:judge is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Launch a sub-agent judge to evaluate results produced in the current conversation

Teams using sadd:judge should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/judge/SKILL.md --create-dirs "https://raw.githubusercontent.com/christophacham/agent-skills-library/main/skills/ai-ml/judge/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/judge/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How sadd:judge Compares

Feature / Agent	sadd:judge	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Launch a sub-agent judge to evaluate results produced in the current conversation

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Judge Command

<task>
You are a coordinator launching a specialized judge sub-agent to evaluate work produced earlier in this conversation. The judge operates with isolated context, provides structured evaluation with evidence-based scoring, and returns actionable feedback.
</task>

<context>
This command implements the LLM-as-Judge pattern with context isolation:
- **Context Isolation**: Judge operates with fresh context, preventing confirmation bias from accumulated session state
- **Chain-of-Thought Scoring**: Justification BEFORE score for 15-25% reliability improvement
- **Evidence-Based**: Every score requires specific citations from the work (file locations, line numbers)
- **Multi-Dimensional Rubric**: Weighted criteria with clear level descriptions
- **Self-Verification**: Dynamic verification questions with documented adjustments

The evaluation is **report-only** - findings are presented without automatic changes.
</context>

## Your Workflow

### Phase 1: Context Extraction

Before launching the judge, identify what needs evaluation:

1. **Identify the work to evaluate**:
   - Review conversation history for completed work
   - If arguments provided: Use them to focus on specific aspects
   - If unclear: Ask user "What work should I evaluate? (code changes, analysis, documentation, etc.)"

2. **Extract evaluation context**:
   - Original task or request that prompted the work
   - The actual output/result produced
   - Files created or modified (with brief descriptions)
   - Any constraints, requirements, or acceptance criteria mentioned

3. **Provide scope for user**:

   ```
   Evaluation Scope:
   - Original request: [summary]
   - Work produced: [description]
   - Files involved: [list]
   - Evaluation focus: [from arguments or "general quality"]

   Launching judge sub-agent...
   ```

**IMPORTANT**: Pass only the extracted context to the judge - not the entire conversation. This prevents context pollution and enables focused assessment.

### Phase 2: Launch Judge Sub-Agent

Use the Task tool to spawn a single judge agent with the following prompt and context. Adjust criteria rubric and weights to match solution type and complexity, for example:

- Code Quality
- Documentation Quality
- Test Coverage
- Security
- Performance
- Usability
- Reliability
- Maintainability
- Scalability
- Cost-effectiveness
- Compliance
- Accessibility
- Performance

**Judge Agent Prompt:**

```markdown
You are an Expert Judge evaluating the quality of work produced in a development session.

## Work Under Evaluation

[ORIGINAL TASK]
{paste the original request/task}
[/ORIGINAL TASK]

[WORK OUTPUT]
{summary of what was created/modified}
[/WORK OUTPUT]

[FILES INVOLVED]
{list of files with brief descriptions}
[/FILES INVOLVED]

[EVALUATION FOCUS]
{from arguments, or "General quality assessment"}
[/EVALUATION FOCUS]

Read ${CLAUDE_PLUGIN_ROOT}/tasks/judge.md and execute.

## Evaluation Criteria

### Criterion 1: Instruction Following (weight: 0.30)

Does the work follow all explicit instructions and requirements?

**Guiding Questions**:
- Does the output fulfill the original request?
- Were all explicit requirements addressed?
- Are there gaps or unexpected deviations?

| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | All instructions followed precisely, no deviations |
| Good | 4 | Minor deviations that do not affect outcome |
| Adequate | 3 | Major instructions followed, minor ones missed |
| Poor | 2 | Significant instructions ignored |
| Failed | 1 | Fundamentally misunderstood the task |

### Criterion 2: Output Completeness (weight: 0.25)

Are all requested aspects thoroughly covered?

**Guiding Questions**:
- Are all components of the request addressed?
- Is there appropriate depth for each component?
- Are there obvious gaps or missing pieces?

| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | All aspects thoroughly covered with appropriate depth |
| Good | 4 | Most aspects covered with minor gaps |
| Adequate | 3 | Key aspects covered, some notable gaps |
| Poor | 2 | Major aspects missing |
| Failed | 1 | Fundamental aspects not addressed |

### Criterion 3: Solution Quality (weight: 0.25)

Is the approach appropriate and well-implemented?

**Guiding Questions**:
- Is the chosen approach sound and appropriate?
- Does the implementation follow best practices?
- Are there correctness issues or errors?

| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | Optimal approach, clean implementation, best practices followed |
| Good | 4 | Good approach with minor issues |
| Adequate | 3 | Reasonable approach, some quality concerns |
| Poor | 2 | Problematic approach or significant quality issues |
| Failed | 1 | Fundamentally flawed approach |

### Criterion 4: Reasoning Quality (weight: 0.10)

Is the reasoning clear, logical, and well-documented?

**Guiding Questions**:
- Is the decision-making transparent?
- Were appropriate methods/tools used?
- Can someone understand why this approach was taken?

| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | Clear, logical reasoning throughout |
| Good | 4 | Generally sound reasoning with minor gaps |
| Adequate | 3 | Basic reasoning present |
| Poor | 2 | Reasoning unclear or flawed |
| Failed | 1 | No apparent reasoning |

### Criterion 5: Response Coherence (weight: 0.10)

Is the output well-structured and easy to understand?

**Guiding Questions**:
- Is the output organized logically?
- Can someone unfamiliar with the task understand it?
- Is it professionally presented?

| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | Well-structured, clear, professional |
| Good | 4 | Generally coherent with minor issues |
| Adequate | 3 | Understandable but could be clearer |
| Poor | 2 | Difficult to follow |
| Failed | 1 | Incoherent or confusing |

```

### Phase 3: Process and Present Results

After receiving the judge's evaluation:

1. **Validate the evaluation**:
   - Check that all criteria have scores in valid range (1-5)
   - Verify each score has supporting justification with evidence
   - Confirm weighted total calculation is correct
   - Check for contradictions between justification and score
   - Verify self-verification was completed with documented adjustments

2. **If validation fails**:
   - Note the specific issue
   - Request clarification or re-evaluation if needed

3. **Present results to user**:
   - Display the full evaluation report
   - Highlight the verdict and key findings
   - Offer follow-up options:
     - Address specific improvements
     - Request clarification on any judgment
     - Proceed with the work as-is

## Scoring Interpretation

| Score Range | Verdict | Interpretation | Recommendation |
|-------------|---------|----------------|----------------|
| 4.50 - 5.00 | EXCELLENT | Exceptional quality, exceeds expectations | Ready as-is |
| 4.00 - 4.49 | GOOD | Solid quality, meets professional standards | Minor improvements optional |
| 3.50 - 3.99 | ACCEPTABLE | Adequate but has room for improvement | Improvements recommended |
| 3.00 - 3.49 | NEEDS IMPROVEMENT | Below standard, requires work | Address issues before use |
| 1.00 - 2.99 | INSUFFICIENT | Does not meet basic requirements | Significant rework needed |

## Important Guidelines

1. **Context Isolation**: Pass only relevant context to the judge - not the entire conversation
2. **Justification First**: Always require evidence and reasoning BEFORE the score
3. **Evidence-Based**: Every score must cite specific evidence (file paths, line numbers, quotes)
4. **Bias Mitigation**: Explicitly warn against length bias, verbosity bias, and authority bias
5. **Be Objective**: Base assessments on evidence and rubric definitions, not preferences
6. **Be Specific**: Cite exact locations, not vague observations
7. **Be Constructive**: Frame criticism as opportunities for improvement with impact context
8. **Consider Context**: Account for stated constraints, complexity, and requirements
9. **Report Confidence**: Lower confidence when evidence is ambiguous or criteria unclear
10. **Single Judge**: This command uses one focused judge for context isolation

## Notes

- This is a **report-only** command - it evaluates but does not modify work
- The judge operates with fresh context for unbiased assessment
- Scores are calibrated to professional development standards
- Low scores indicate improvement opportunities, not failures
- Use the evaluation to inform next steps and iterations
- Pass threshold (3.5/5.0) represents acceptable quality for general use
- Adjust threshold based on criticality (4.0+ for critical operations)
- Low confidence evaluations may warrant human review

Related Skills

sadd:judge-with-debate

from christophacham/agent-skills-library

Evaluate solutions through multi-round debate between independent judges until consensus

sadd:tree-of-thoughts

from christophacham/agent-skills-library

Execute tasks through systematic exploration, pruning, and expansion using Tree of Thoughts methodology with multi-agent evaluation

sadd:subagent-driven-development

from christophacham/agent-skills-library

Use when executing implementation plans with independent tasks in the current session or facing 3+ independent issues that can be investigated without shared state or dependencies - dispatches fresh subagent for each task with code review between tasks, enabling fast iteration with quality gates

sadd:multi-agent-patterns

from christophacham/agent-skills-library

Design multi-agent architectures for complex tasks. Use when single-agent context limits are exceeded, when tasks decompose naturally into subtasks, or when specializing agents improves quality.

sadd:launch-sub-agent

from christophacham/agent-skills-library

Launch an intelligent sub-agent with automatic model selection based on task complexity, specialized agent matching, Zero-shot CoT reasoning, and mandatory self-critique verification

sadd:do-in-steps

from christophacham/agent-skills-library

Execute complex tasks through sequential sub-agent orchestration with intelligent model selection, and LLM-as-a-judge verification

sadd:do-competitively

from christophacham/agent-skills-library

Execute tasks through competitive multi-agent generation, multi-judge evaluation, and evidence-based synthesis

sadd:do-and-judge

from christophacham/agent-skills-library

Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop

genderapi-io-automation

from christophacham/agent-skills-library

Automate Genderapi IO tasks via Rube MCP (Composio). Always search tools first for current schemas.

gender-api-automation

from christophacham/agent-skills-library

Automate Gender API tasks via Rube MCP (Composio). Always search tools first for current schemas.

fred-economic-data

from christophacham/agent-skills-library

Query FRED (Federal Reserve Economic Data) API for 800,000+ economic time series from 100+ sources. Access GDP, unemployment, inflation, interest rates, exchange rates, housing, and regional data. Use for macroeconomic analysis, financial research, policy studies, economic forecasting, and academic research requiring U.S. and international economic indicators.

fidel-api-automation

from christophacham/agent-skills-library

Automate Fidel API tasks via Rube MCP (Composio). Always search tools first for current schemas.