review-interface
Build custom annotation UIs for human review of agent traces, LLM outputs, and labeled data. Generates a self-contained HTML interface for reviewing, labeling, comparing, and exporting judgments. For calibrating evals, auditing agent behavior, and building gold-standard datasets. Triggers on: "review interface", "annotation ui", "labeling interface", "review ui", "human review"
Best use case
review-interface is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Build custom annotation UIs for human review of agent traces, LLM outputs, and labeled data. Generates a self-contained HTML interface for reviewing, labeling, comparing, and exporting judgments. For calibrating evals, auditing agent behavior, and building gold-standard datasets. Triggers on: "review interface", "annotation ui", "labeling interface", "review ui", "human review"
Teams using review-interface should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/review-interface/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How review-interface Compares
| Feature / Agent | review-interface | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Build custom annotation UIs for human review of agent traces, LLM outputs, and labeled data. Generates a self-contained HTML interface for reviewing, labeling, comparing, and exporting judgments. For calibrating evals, auditing agent behavior, and building gold-standard datasets. Triggers on: "review interface", "annotation ui", "labeling interface", "review ui", "human review"
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
# /review-interface
> Build custom annotation UIs for human review of agent outputs and traces.
## Purpose
Generate a self-contained HTML annotation interface tailored to a specific review task. Supports reviewing LLM outputs, agent execution traces, side-by-side comparisons, and data labeling. The interface loads data from a JSONL file, presents items one at a time with the configured annotation controls, tracks progress, and exports labeled results. No server required — runs entirely in the browser from a single HTML file.
## Usage
```bash
# Build review UI for eval outputs
/review-interface --data eval-results.jsonl --task "rate output quality" --labels pass,fail
# Build comparison UI (A vs B)
/review-interface --data comparisons.jsonl --task "which response is better" --mode compare
# Build trace review UI
/review-interface --data agent-traces.jsonl --task "identify failure point" --mode trace
# Custom annotation schema
/review-interface --data outputs.jsonl --schema annotation-schema.yaml
# Build with pre-filled labels (for review/correction)
/review-interface --data labeled.jsonl --labels pass,fail --prefilled
```
## Arguments
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--data` | string | required | Path to JSONL data file to review |
| `--task` | string | required | Description of the review task (shown in UI header) |
| `--labels` | string | — | Comma-separated label options (e.g., `pass,fail` or `good,okay,bad`) |
| `--mode` | enum | `single` | Review mode: `single` (one item), `compare` (A vs B), `trace` (step-by-step) |
| `--schema` | string | — | Path to custom annotation schema (YAML) |
| `--prefilled` | flag | false | Load existing labels for review/correction |
| `--output` | string | `review-interface.html` | Output HTML file path |
| `--items-per-page` | int | `1` | Items shown per page |
| `--randomize` | flag | false | Randomize item presentation order |
| `--blind` | flag | false | Hide metadata (model name, config) to reduce bias |
## Workflow
1. **Analyze data** — Parse the JSONL file. Identify fields: input, output, metadata, scores, trace steps. Determine which fields are reviewable vs. contextual.
2. **Configure annotation** — Based on `--labels`, `--schema`, or `--mode`, define the annotation controls: radio buttons (categorical), text input (free-form notes), sliders (continuous), checkboxes (multi-label), or side-by-side selectors (comparison).
3. **Design layout** — Build the review interface layout:
- Header: task description, progress bar, item counter
- Context panel: input/question/prompt (always visible)
- Review panel: output(s) to evaluate (mode-dependent)
- Annotation panel: label controls, notes field, confidence selector
- Navigation: prev/next, skip, keyboard shortcuts
4. **Build interface** — Generate a single self-contained HTML file with inline CSS and JavaScript. Features:
- Data loaded from embedded JSON or file input
- Local storage for progress persistence (survives page refresh)
- Keyboard shortcuts: 1-9 for labels, Enter for next, Backspace for prev, S for skip
- Progress tracking: completed, skipped, remaining
- Export: download labeled results as JSONL
- Filter: show only unlabeled, show only labeled, show all
5. **Blind mode** — If `--blind` is set, strip model identifiers, configuration details, and any metadata that could bias the reviewer. Randomize A/B order in comparison mode.
6. **Output** — Write the HTML file. Report: total items, annotation schema, estimated review time.
## Examples
### Binary quality review
```
/review-interface --data eval-results.jsonl --task "Is this summary accurate?" --labels pass,fail
## Review Interface Generated
### Configuration
- Task: "Is this summary accurate?"
- Items: 200
- Labels: pass, fail
- Mode: single
- Keyboard: 1=pass, 2=fail, Enter=next, Backspace=prev
### Data Fields
- Input: source document (scrollable)
- Output: generated summary
- Annotation: pass/fail + optional notes
### Estimated Review Time
- At 30s per item: ~100 minutes
- At 15s per item: ~50 minutes
### Output: review-interface.html (82 KB)
```
### Side-by-side comparison
```
/review-interface --data comparisons.jsonl --task "Which response is better?" --mode compare --blind
## Review Interface Generated
### Configuration
- Task: "Which response is better?"
- Items: 150
- Mode: compare (A vs B, blinded)
- Labels: A is better, B is better, Tie
- Keyboard: 1=A, 2=B, 3=Tie, Enter=next
### Blind Mode
- Model names hidden
- Response order randomized per item
- No metadata visible during review
```
### Custom annotation schema
```yaml
# annotation-schema.yaml
fields:
- name: quality
type: radio
options: [excellent, good, acceptable, poor]
required: true
- name: errors
type: checkbox
options: [factual-error, hallucination, incomplete, off-topic, formatting]
required: false
- name: notes
type: text
placeholder: "Optional notes..."
required: false
- name: confidence
type: slider
min: 1
max: 5
default: 3
```
## Output
```markdown
## Review Interface Generated
### File: <output>.html
### Size: N KB (self-contained, no dependencies)
### Configuration
- Task: <task description>
- Items: N
- Mode: <mode>
- Labels/Schema: <description>
### Features
- Progress persistence (localStorage)
- Keyboard shortcuts
- Export to JSONL
- Filter by label status
- Blind mode: <on/off>
### Estimated Review Time
- N items at ~Ns per item: ~N minutes
```
## Dependencies
- Data file (JSONL with items to review)
- Optional: annotation schema (YAML)
- No runtime dependencies (output is self-contained HTML)
- `/judge-prompt` — Upstream if building labels for judge calibration
- `/validate-evaluator` — Downstream consumer of exported human labels
- `/error-analysis` — Upstream if trace review is needed for failure diagnosisRelated Skills
/review
> Thorough code review across correctness, security, performance, and maintainability.
code-review
Branch code review with structured scoring against project guidelines. Produces categorized findings with severity levels and LLM-generated fix prompts. Quality gate: only posts review to PR if score meets threshold. Triggers on: "code review", "review branch", "review PR", "review changes", "score code"
/do
> The agent's primary skill. Customize this to match your agent's purpose.
/report
> Generate structured reports. Director-owned.
/primary
> Main workflow execution and routing. Director-owned.
Qualify
## Command
Prospect
## Command
Close Plan
## Command
Battlecard
## Command
Spec
## Command
Schedule
## Command
Repurpose
## Command