llm-judge
Use an LLM as an automated judge to score responses against a rubric. Use when you need nuanced quality evaluation that can't be done with exact match or regex. Triggers include "score with AI", "llm-judge", "rubric-based scoring", "quality evaluation", or any need to rate free-form responses.
Best use case
llm-judge is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Use an LLM as an automated judge to score responses against a rubric. Use when you need nuanced quality evaluation that can't be done with exact match or regex. Triggers include "score with AI", "llm-judge", "rubric-based scoring", "quality evaluation", or any need to rate free-form responses.
Teams using llm-judge should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/llm-judge/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How llm-judge Compares
| Feature / Agent | llm-judge | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Use an LLM as an automated judge to score responses against a rubric. Use when you need nuanced quality evaluation that can't be done with exact match or regex. Triggers include "score with AI", "llm-judge", "rubric-based scoring", "quality evaluation", or any need to rate free-form responses.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# llm-judge
Use an LLM model to score responses against a natural language rubric. Returns a 0.0-1.0 score with reasoning.
## How it works
The judge model receives:
1. The rubric describing what a good response looks like
2. The actual response to evaluate
The judge returns a JSON object: `{"score": 0.0-1.0, "reason": "brief explanation"}`
## Configuration in YAML
```yaml
scoring:
- method: llm-judge
rubric: "The answer clearly explains the return policy and includes the timeframe"
threshold: 0.8 # score >= 0.8 = pass (default: 0.7)
weight: 2 # double weight vs other scorers (default: 1)
endpoint: openai-prod # override default judge endpoint
model: gpt-4o-mini # override default judge model
```
## Rubric Writing Tips
Write rubrics as positive statements of what a good response contains:
Good rubric:
```
"The response mentions the 30-day return window and explains the condition requirement"
```
Bad rubric (vague):
```
"The response is good"
```
Good rubric for tone:
```
"The response is empathetic, acknowledges the customer's frustration, and provides a concrete next step"
```
## Default Configuration
Set globally in Settings or environment:
- Default judge endpoint: `openai-prod` (configurable)
- Default judge model: `gpt-4o-mini` (configurable - cheaper models work well for judging)
## Prompt Injection Protection
Case input is wrapped in XML delimiters before insertion into the judge prompt to prevent injection attacks:
```
<case_input>
{user's case input}
</case_input>
```
## Cost Considerations
llm-judge makes one additional LLM call per scored case. For 100 test cases, each with one judge scorer, you incur 100 extra LLM calls. Use `gpt-4o-mini` or `claude-3-haiku` as the judge to minimize cost.Related Skills
Skill: Uptime Monitoring
## Overview
Skill: Status Page
## Overview
Skill: unit-conversion
## Overview
Skill: recipe-scaler
## Overview
reading-list
Operate the reading-list API to save, manage, tag, search, and export articles.
email-digest
Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.
websocket-realtime
Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".
poll-builder
Self-hosted poll creation tool with real-time results. Use when you need to create a poll, check vote counts, close a poll, export results, or get the shareable link for a poll. Triggers include "create poll", "vote", "poll results", "survey", "collect votes", "share poll", or any task involving polling or voting.
Skill: personal-finance
## Overview
Skill: csv-import
## Overview
Skill: Syntax Highlighting
## Purpose
Skill: Pastebin Core
## Purpose