agentic-eval
Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality
Best use case
agentic-eval is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality
Teams using agentic-eval should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/agentic-eval/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How agentic-eval Compares
| Feature / Agent | agentic-eval | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# Agentic Evaluation Patterns
Patterns for self-improvement through iterative evaluation and refinement.
## Overview
Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.
```
Generate → Evaluate → Critique → Refine → Output
↑ │
└──────────────────────────────┘
```
## When to Use
- **Quality-critical generation**: Code, reports, analysis requiring high accuracy
- **Tasks with clear evaluation criteria**: Defined success metrics exist
- **Content requiring specific standards**: Style guides, compliance, formatting
---
## Pattern 1: Basic Reflection
Agent evaluates and improves its own output through self-critique.
```python
def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
"""Generate with reflection loop."""
output = llm(f"Complete this task:\n{task}")
for i in range(max_iterations):
# Self-critique
critique = llm(f"""
Evaluate this output against criteria: {criteria}
Output: {output}
Rate each: PASS/FAIL with feedback as JSON.
""")
critique_data = json.loads(critique)
all_pass = all(c["status"] == "PASS" for c in critique_data.values())
if all_pass:
return output
# Refine based on critique
failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
output = llm(f"Improve to address: {failed}\nOriginal: {output}")
return output
```
**Key insight**: Use structured JSON output for reliable parsing of critique results.
---
## Pattern 2: Evaluator-Optimizer
Separate generation and evaluation into distinct components for clearer responsibilities.
```python
class EvaluatorOptimizer:
def __init__(self, score_threshold: float = 0.8):
self.score_threshold = score_threshold
def generate(self, task: str) -> str:
return llm(f"Complete: {task}")
def evaluate(self, output: str, task: str) -> dict:
return json.loads(llm(f"""
Evaluate output for task: {task}
Output: {output}
Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
"""))
def optimize(self, output: str, feedback: dict) -> str:
return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
def run(self, task: str, max_iterations: int = 3) -> str:
output = self.generate(task)
for _ in range(max_iterations):
evaluation = self.evaluate(output, task)
if evaluation["overall_score"] >= self.score_threshold:
break
output = self.optimize(output, evaluation)
return output
```
---
## Pattern 3: Code-Specific Reflection
Test-driven refinement loop for code generation.
```python
class CodeReflector:
def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
code = llm(f"Write Python code for: {spec}")
tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
for _ in range(max_iterations):
result = run_tests(code, tests)
if result["success"]:
return code
code = llm(f"Fix error: {result['error']}\nCode: {code}")
return code
```
---
## Evaluation Strategies
### Outcome-Based
Evaluate whether output achieves the expected result.
```python
def evaluate_outcome(task: str, output: str, expected: str) -> str:
return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")
```
### LLM-as-Judge
Use LLM to compare and rank outputs.
```python
def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")
```
### Rubric-Based
Score outputs against weighted dimensions.
```python
RUBRIC = {
"accuracy": {"weight": 0.4},
"clarity": {"weight": 0.3},
"completeness": {"weight": 0.3}
}
def evaluate_with_rubric(output: str, rubric: dict) -> float:
scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5
```
---
## Best Practices
| Practice | Rationale |
|----------|-----------|
| **Clear criteria** | Define specific, measurable evaluation criteria upfront |
| **Iteration limits** | Set max iterations (3-5) to prevent infinite loops |
| **Convergence check** | Stop if output score isn't improving between iterations |
| **Log history** | Keep full trajectory for debugging and analysis |
| **Structured output** | Use JSON for reliable parsing of evaluation results |
---
## Quick Start Checklist
```markdown
## Evaluation Implementation Checklist
### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)
### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop
### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully
```Related Skills
tech-stack-evaluator
Technology stack evaluation and comparison with TCO analysis, security assessment, and ecosystem health scoring. Use when comparing frameworks, evaluating technology stacks, calculating total cost of ownership, assessing migration paths, or analyzing ecosystem viability.
youtube-watcher
Fetch and read transcripts from YouTube videos. Use when you need to summarize a video, answer questions about its content, or extract information from it.
youtube-transcript
Fetch and summarize YouTube video transcripts. Use when asked to summarize, transcribe, or extract content from YouTube videos. Handles transcript fetching via residential IP proxy to bypass YouTube's cloud IP blocks.
youtube-auto-captions - YouTube 自动字幕
## 描述
youtube
YouTube Data API integration with managed OAuth. Search videos, manage playlists, access channel data, and interact with comments. Use this skill when users want to interact with YouTube. For other third party apps, use the api-gateway skill (https://clawhub.ai/byungkyu/api-gateway).
yahoo-finance
Get stock prices, quotes, fundamentals, earnings, options, dividends, and analyst ratings using Yahoo Finance. Uses yfinance library - no API key required.
xurl
A Twitter research and content intelligence skill focused on attracting WordPress and Shopify clients. Use to analyze Twitter profiles, threads, and conversations for: (1) Identifying what small agency founders and eCommerce brands are discussing; (2) Understanding pain points around WordPress performance, Shopify CRO, and development bottlenecks; (3) Extracting high-performing content angles; (4) Turning insights into authority-building posts; (5) Converting Twitter intelligence into business leverage for clear content angles, strong positioning, and qualified inbound leads.
xlsx
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
xiaohongshu-mcp
Automate Xiaohongshu (RedNote) content operations using a Python client for the xiaohongshu-mcp server. Use for: (1) Publishing image, text, and video content, (2) Searching for notes and trends, (3) Analyzing post details and comments, (4) Managing user profiles and content feeds. Triggers: xiaohongshu automation, rednote content, publish to xiaohongshu, xiaohongshu search, social media management.
twitter-openclaw
Interact with Twitter/X — read tweets, search, post, like, retweet, and manage your timeline.
x-twitter-growth
X/Twitter growth engine for building audience, crafting viral content, and analyzing engagement. Use when the user wants to grow on X/Twitter, write tweets or threads, analyze their X profile, research competitors on X, plan a posting strategy, or optimize engagement. Complements social-content (generic multi-platform) with X-specific depth: algorithm mechanics, thread engineering, reply strategy, profile optimization, and competitive intelligence via web search.
akshare-online-alpha
Run Wyckoff master-style analysis from stock codes, holdings (symbol/cost/qty), cash, CSV data, and optional chart images. Use when users want online multi-source data fetching with source switching, strict Beijing-time trading-session checks, fixed system prompt analysis, single-stock analysis, holding rotation, holding add/reduce suggestions, or empty-position cash deployment suggestions.