llm-judge

Use an LLM as an automated judge to score responses against a rubric. Use when you need nuanced quality evaluation that can't be done with exact match or regex. Triggers include "score with AI", "llm-judge", "rubric-based scoring", "quality evaluation", or any need to rate free-form responses.

7 stars

Best use case

llm-judge is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Use an LLM as an automated judge to score responses against a rubric. Use when you need nuanced quality evaluation that can't be done with exact match or regex. Triggers include "score with AI", "llm-judge", "rubric-based scoring", "quality evaluation", or any need to rate free-form responses.

Teams using llm-judge should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/llm-judge/SKILL.md --create-dirs "https://raw.githubusercontent.com/heldernoid/agentic-build-templates/main/projects/ai-llm-tools/eval-runner/skills/llm-judge/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/llm-judge/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How llm-judge Compares

Feature / Agentllm-judgeStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Use an LLM as an automated judge to score responses against a rubric. Use when you need nuanced quality evaluation that can't be done with exact match or regex. Triggers include "score with AI", "llm-judge", "rubric-based scoring", "quality evaluation", or any need to rate free-form responses.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# llm-judge

Use an LLM model to score responses against a natural language rubric. Returns a 0.0-1.0 score with reasoning.

## How it works

The judge model receives:
1. The rubric describing what a good response looks like
2. The actual response to evaluate

The judge returns a JSON object: `{"score": 0.0-1.0, "reason": "brief explanation"}`

## Configuration in YAML

```yaml
scoring:
  - method: llm-judge
    rubric: "The answer clearly explains the return policy and includes the timeframe"
    threshold: 0.8       # score >= 0.8 = pass (default: 0.7)
    weight: 2            # double weight vs other scorers (default: 1)
    endpoint: openai-prod  # override default judge endpoint
    model: gpt-4o-mini    # override default judge model
```

## Rubric Writing Tips

Write rubrics as positive statements of what a good response contains:

Good rubric:
```
"The response mentions the 30-day return window and explains the condition requirement"
```

Bad rubric (vague):
```
"The response is good"
```

Good rubric for tone:
```
"The response is empathetic, acknowledges the customer's frustration, and provides a concrete next step"
```

## Default Configuration

Set globally in Settings or environment:
- Default judge endpoint: `openai-prod` (configurable)
- Default judge model: `gpt-4o-mini` (configurable - cheaper models work well for judging)

## Prompt Injection Protection

Case input is wrapped in XML delimiters before insertion into the judge prompt to prevent injection attacks:

```
<case_input>
{user's case input}
</case_input>
```

## Cost Considerations

llm-judge makes one additional LLM call per scored case. For 100 test cases, each with one judge scorer, you incur 100 extra LLM calls. Use `gpt-4o-mini` or `claude-3-haiku` as the judge to minimize cost.

Related Skills

Skill: Uptime Monitoring

7
from heldernoid/agentic-build-templates

## Overview

Skill: Status Page

7
from heldernoid/agentic-build-templates

## Overview

Skill: unit-conversion

7
from heldernoid/agentic-build-templates

## Overview

Skill: recipe-scaler

7
from heldernoid/agentic-build-templates

## Overview

reading-list

7
from heldernoid/agentic-build-templates

Operate the reading-list API to save, manage, tag, search, and export articles.

email-digest

7
from heldernoid/agentic-build-templates

Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.

websocket-realtime

7
from heldernoid/agentic-build-templates

Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".

poll-builder

7
from heldernoid/agentic-build-templates

Self-hosted poll creation tool with real-time results. Use when you need to create a poll, check vote counts, close a poll, export results, or get the shareable link for a poll. Triggers include "create poll", "vote", "poll results", "survey", "collect votes", "share poll", or any task involving polling or voting.

Skill: personal-finance

7
from heldernoid/agentic-build-templates

## Overview

Skill: csv-import

7
from heldernoid/agentic-build-templates

## Overview

Skill: Syntax Highlighting

7
from heldernoid/agentic-build-templates

## Purpose

Skill: Pastebin Core

7
from heldernoid/agentic-build-templates

## Purpose