Build Your Evaluation Skill
Create a reusable skill for evaluating fine-tuned models, benchmarking performance, and detecting quality regressions
Best use case
Build Your Evaluation Skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Create a reusable skill for evaluating fine-tuned models, benchmarking performance, and detecting quality regressions
Teams using Build Your Evaluation Skill should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/69-evaluation-quality-gates/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Build Your Evaluation Skill Compares
| Feature / Agent | Build Your Evaluation Skill | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Create a reusable skill for evaluating fine-tuned models, benchmarking performance, and detecting quality regressions
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Build Your Evaluation Skill
Before learning about model evaluation, you will build the skill that captures that knowledge. This skill-first approach means every concept you learn gets encoded into a reusable asset that becomes part of your Digital FTE toolkit.
When you fine-tune a model, how do you know it actually improved? A model might generate fluent text that completely misses the point. Evaluation frameworks provide systematic methods to measure what matters: accuracy, format compliance, reasoning quality, and safety. By the end of this chapter, you will have a skill that guides evaluation decisions for any fine-tuned model.
## Step 1: Clone Skills Lab Fresh
Every chapter starts with a clean environment. This prevents state pollution from previous work and ensures reproducible results.
```bash
# Navigate to your workspace
cd ~/workspace
# Clone fresh skills-lab (or reset if exists)
if [ -d "skills-lab-llmops" ]; then
rm -rf skills-lab-llmops
fi
git clone https://github.com/panaversity/skills-lab.git skills-lab-llmops
cd skills-lab-llmops
# Create chapter directory
mkdir -p llmops-evaluation
cd llmops-evaluation
```
**Output:**
```
Cloning into 'skills-lab-llmops'...
remote: Enumerating objects: 156, done.
remote: Counting objects: 100% (156/156), done.
Receiving objects: 100% (156/156), 45.23 KiB | 2.26 MiB/s, done.
```
## Step 2: Write Your LEARNING-SPEC.md
Before fetching documentation, articulate what you want to learn. This specification drives focused learning.
Create `LEARNING-SPEC.md`:
```markdown
# Learning Specification: LLM Evaluation & Quality Gates
## Intent
Learn to systematically evaluate fine-tuned models to ensure they meet quality standards before deployment.
## What I Want to Learn
1. **Evaluation Taxonomy**: What metrics matter for different use cases?
2. **LLM-as-Judge**: How to use GPT-4 as an evaluator for subjective quality
3. **Benchmark Design**: How to create task-specific benchmarks for the Task API
4. **Regression Testing**: How to detect when model quality degrades
5. **Quality Gates**: How to define pass/fail thresholds for deployment
## Success Criteria
- [ ] I can select appropriate evaluation metrics for a given task
- [ ] I can implement LLM-as-Judge with structured rubrics
- [ ] I can create a custom benchmark for JSON output validation
- [ ] I can detect quality regression between model versions
- [ ] I can define quality gates that block bad deployments
## Constraints
- Must work on Colab Free Tier (T4, 15GB VRAM)
- Focus on practical evaluation, not research benchmarks
- Use lm-evaluation-harness as the primary tool
- Integrate with Task API from Chapter 40
## Prior Knowledge
- Chapter 64: SFT fundamentals
- Chapter 65-68: Various fine-tuning approaches
- Chapter 40: Task API structure
## Time Budget
- This lesson: 25 minutes (skill creation)
- Full chapter: ~4 hours (all evaluation concepts)
```
## Step 3: Fetch Official Documentation
Use Context7 to retrieve the authoritative lm-evaluation-harness documentation. This ensures your skill is grounded in official patterns, not hallucinated best practices.
```
/fetching-library-docs lm-evaluation-harness
```
**Key concepts to extract from documentation:**
| Concept | What It Means |
|---------|---------------|
| **Task** | A specific evaluation benchmark (e.g., "hellaswag", "mmlu") |
| **Model** | The model being evaluated (supports HuggingFace, OpenAI, local) |
| **Metric** | What gets measured (accuracy, perplexity, exact match) |
| **Few-shot** | Number of examples provided in prompt before evaluation |
| **Log-likelihood** | Probability the model assigns to correct answer |
## Step 4: Create Your Initial Skill
Create `llmops-evaluator/SKILL.md`:
```markdown
---
name: llmops-evaluator
description: "This skill should be used when evaluating fine-tuned LLM quality. Use when selecting metrics, designing benchmarks, running evaluations, or setting quality gates for model deployment."
---
# LLMOps Evaluator Skill
## When to Use This Skill
Invoke this skill when you need to:
- Evaluate a fine-tuned model before deployment
- Compare model versions for regression
- Design custom benchmarks for your use case
- Set pass/fail thresholds for CI/CD pipelines
- Debug why a model is underperforming
## Evaluation Decision Framework
### Step 1: Identify Evaluation Type
| Use Case | Evaluation Type | Primary Metrics |
|----------|----------------|-----------------|
| Classification | Accuracy-based | Accuracy, F1, Precision, Recall |
| Generation | Quality-based | Perplexity, BLEU, ROUGE |
| Instruction-following | LLM-as-Judge | Rubric scores (1-5) |
| JSON output | Format validation | Schema compliance rate |
| Safety | Red-teaming | Harmful response rate |
### Step 2: Select Benchmarks
**Standard Benchmarks** (for general capability):
- **MMLU**: General knowledge across domains
- **HellaSwag**: Common-sense reasoning
- **ARC**: Science reasoning
- **TruthfulQA**: Factual accuracy
**Task-Specific Benchmarks** (for your domain):
- Create custom evaluation sets matching your use case
- Minimum: 100 examples for reliable measurement
- Include edge cases and failure modes
### Step 3: Run Evaluation
```bash
# Basic evaluation with lm-eval-harness
lm_eval --model hf \
--model_args pretrained=my-fine-tuned-model \
--tasks hellaswag,arc_easy \
--batch_size 8 \
--output_path ./results
```
### Step 4: Define Quality Gates
**Deployment Thresholds**:
- Accuracy: > 85% on task-specific benchmark
- Harmful response rate: < 5%
- Schema compliance: > 95% for JSON output
- Regression: New model >= Previous model - 2%
## Common Patterns
### Pattern 1: A/B Model Comparison
```python
def compare_models(model_a_results, model_b_results, threshold=0.02):
"""Compare two models and determine if B is a regression from A."""
delta = model_b_results['accuracy'] - model_a_results['accuracy']
if delta < -threshold:
return "REGRESSION", f"Model B is {abs(delta):.2%} worse"
elif delta > threshold:
return "IMPROVEMENT", f"Model B is {delta:.2%} better"
else:
return "EQUIVALENT", f"Within {threshold:.2%} threshold"
```
### Pattern 2: LLM-as-Judge Template
```python
JUDGE_PROMPT = """
Evaluate the assistant's response on a scale of 1-5:
User Request: {input}
Assistant Response: {output}
Expected Behavior: {expected}
Criteria:
- Accuracy: Does the response correctly address the request?
- Format: Does the response follow the expected format?
- Helpfulness: Is the response useful and complete?
Score (1-5):
Reasoning:
"""
```
## Quality Gate Checklist
Before deploying a fine-tuned model, verify:
- [ ] Task-specific accuracy > threshold
- [ ] No regression from previous version
- [ ] Format compliance verified
- [ ] Safety evaluation passed
- [ ] Cost/latency within budget
```
## Step 5: Verify Skill Works
Test that your skill provides useful guidance:
```bash
# Verify skill file exists and is valid
cat llmops-evaluator/SKILL.md | head -20
```
**Output:**
```
---
name: llmops-evaluator
description: "This skill should be used when evaluating fine-tuned LLM quality..."
---
# LLMOps Evaluator Skill
...
```
Your skill now exists as a starting point. As you progress through this chapter, you will add:
- Detailed evaluation taxonomy (L01)
- LLM-as-Judge implementation patterns (L02)
- Task-specific benchmark design (L03)
- Regression testing workflows (L04)
- Quality gate configurations (L05)
## Skill Evolution Map
Track how your skill grows through this chapter:
| Lesson | What Gets Added |
|--------|----------------|
| L00 (now) | Initial framework, basic decision tree |
| L01 | Evaluation taxonomy, metric selection guide |
| L02 | LLM-as-Judge prompts and rubrics |
| L03 | Custom benchmark creation patterns |
| L04 | A/B testing, regression detection |
| L05 | CI/CD gate configurations |
| L06 | Complete pipeline integration |
## Try With AI
### Prompt 1: Review Your LEARNING-SPEC
```
I wrote this LEARNING-SPEC.md for learning LLM evaluation:
[paste your LEARNING-SPEC.md]
1. Are my success criteria specific and measurable?
2. What am I missing that would be important for production evaluation?
3. Do my constraints match real-world limitations?
```
**What you are learning**: Specification refinement. Your AI partner helps identify gaps in your learning goals before you invest time in the wrong direction.
### Prompt 2: Expand the Skill Framework
```
I'm building an llmops-evaluator skill. Review my initial framework:
[paste your SKILL.md]
Suggest 3 additional decision frameworks I should include for:
1. Choosing between automated metrics vs human evaluation
2. Determining sample size for reliable benchmarks
3. Handling evaluation of creative/open-ended outputs
```
**What you are learning**: Skill architecture. Evaluation has many dimensions. Your AI partner helps identify frameworks you might not have considered.
### Prompt 3: Connect to Task API
```
My fine-tuned model outputs JSON for a Task API with this schema:
{
"action": "create|complete|list|delete",
"title": "string",
"priority": "low|medium|high",
"due_date": "string|null"
}
Design 5 evaluation test cases that would catch common failure modes:
- Invalid JSON
- Missing required fields
- Wrong action selection
- Inappropriate priority assignment
- Format consistency issues
```
**What you are learning**: Domain-specific evaluation design. Generic benchmarks miss your specific requirements. Your AI partner helps design tests that match your actual use case.
### Safety Note
As you build evaluation frameworks, remember that evaluation can give false confidence. A model passing benchmarks does not guarantee safety in deployment. Always include human review for novel situations and maintain logging for post-deployment monitoring.Related Skills
advanced-evaluation
Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or establishing quality standards for AI-generated content.
admin-panel-builder
Expert assistant for creating and maintaining admin panel pages in the KR92 Bible Voice project. Use when creating admin pages, building admin components, integrating with admin navigation, or adding admin features.
adk-agent-builder
Build production-ready AI agents using Google's Agent Development Kit with AI assistant integration, React patterns, multi-agent orchestration, and comprehensive tool libraries. Use when appropriate context detected. Trigger with relevant phrases based on skill purpose.
adb-builder
No description provided.
action-builder-skill
Use when creating or refactoring Nango integration actions to be thin API wrappers - provides patterns for minimal transformation logic, direct proxy calls, and standardized structure
acc-create-test-builder
Generates Test Data Builder and Object Mother patterns for PHP 8.5. Creates fluent builders with sensible defaults and factory methods for test data creation.
acc-create-builder
Generates Builder pattern for PHP 8.5. Creates step-by-step object construction with fluent interface and validation. Includes unit tests.
web-artifacts-builder
Suite of tools for creating elaborate, multi-component claude.ai HTML artifacts using modern frontend web technologies (React, Tailwind CSS, shadcn/ui). Use for complex artifacts requiring state management, routing, or shadcn/ui components - not for simple single-file HTML/JSX artifacts.
Build Your LiveKit Agents Skill
Create your LiveKit Agents skill from official documentation, then learn to improve it throughout the chapter
Build Your Agent Integration Skill
Create your agent-integration skill from OpenAI SDK and LiteLLM documentation before learning framework integration
Build Your Model Serving Skill
Create your model-serving skill from Ollama documentation before learning deployment theory
artifacts-builder
Suite of tools for creating elaborate, multi-component claude.ai HTML artifacts using modern frontend web technologies (React, Tailwind CSS, shadcn/ui). Use for complex artifacts requiring state management, routing, or shadcn/ui components - not for simple single-file HTML/JSX artifacts.