ai-eval-design-and-iteration

Develop "quizzes" (evals) to measure model performance on specific tasks. Use these benchmarks to guide fine-tuning, determine product UX patterns, and track performance improvements over time. Use this when launching a new AI feature, switching between model versions, or optimizing for high-stakes accuracy.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

ai-eval-design-and-iteration is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using ai-eval-design-and-iteration should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ai-eval-design-and-iteration/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/design/ai-eval-design-and-iteration/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ai-eval-design-and-iteration/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ai-eval-design-and-iteration Compares

Feature / Agent	ai-eval-design-and-iteration	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# AI Eval Design and Iteration

In traditional software, inputs and outputs are defined. In AI, inputs and outputs are fuzzy. Evals (evaluations) are the "unit tests" for AI products. They allow you to move from "vibes-based" development to metric-driven iteration. By building a rigorous "quiz" for your model, you can determine exactly how capable your product is and where it requires human-in-the-loop scaffolding.

## The Eval Workflow

### 1. Identify "Hero Use Cases"
Don't start with generic benchmarks (like MMLU). Instead, define the specific "hero" scenarios your product must master.
- Identify the 10–20 most common or high-value queries users will give your model.
- For each query, define what a "Perfect/Gold" answer looks like.
- Include edge cases where you expect the model to struggle (e.g., complex reasoning or specific formatting).

### 2. Design the "Quiz" (The Eval)
Create a set of tests to gauge how well the model knows the subject material.
- **Input:** The specific prompt or instruction.
- **Reference:** The "Gold" standard answer or a set of criteria (e.g., "Must mention X," "Must not exceed 200 words").
- **Scoring Mechanism:** Use a more powerful model (like O1 or GPT-4o) to grade the output of your production model based on your criteria.

### 3. Apply the "Hill Climbing" Process
Use the eval scores to guide your development cycle.
- Run the eval on your baseline model.
- **Fine-Tune:** If scores are low, provide 1,000+ examples of "Problem -> Good Answer" to the model to "teach" it the specific task.
- **Re-Test:** Run the eval again to see if performance increased.
- **Iterate:** If performance plateaus, break the problem down into smaller tasks (ensembling) and create specific evals for each sub-task.

### 4. Determine UX Based on Accuracy Thresholds
The "score" of your eval dictates the product's user interface. Kevin Weil's 60/95/99 Rule:
- **60% Accuracy:** Build a "Co-pilot" or "Draft" experience where the user must heavily edit the output.
- **95% Accuracy:** Build a "Human-in-the-loop" experience where the model does the work, and a human briefly reviews it.
- **99.5% Accuracy:** Build an "Agentic" or "Automated" experience where the model acts autonomously.

## Examples

**Example 1: Deep Research Tool**
- **Context:** Building a tool that researches a topic for 30 minutes and writes a 20-page report.
- **The Eval:** A prompt asking to "Compare the competitive landscape of fusion energy companies in 2024."
- **Criteria:** Does it mention Helion? Does it cite sources? Is the report 15+ pages?
- **Application:** If the model gets the history right but misses current news, the team adds an eval specifically for "Recency" and fine-tunes the browsing tool.

**Example 2: Customer Support Agent**
- **Context:** An automated agent to handle refunds and technical questions.
- **The Eval:** 500 historic tickets with verified "correct" resolutions.
- **Application:** The team finds the model is 98% accurate on refunds but only 70% on technical debugging.
- **Output:** The UX is designed to automate refunds instantly but route all technical questions to a human agent with a "suggested" draft.

## Common Pitfalls

- **Using Static Evals:** AI models and user behaviors change every few months. If you don't update your "quiz" to reflect new capabilities or user errors, your metrics will become meaningless.
- **Over-Scaffolding for Today's Model:** Avoid building complex "if/then" code to fix a model's current mistake. In 2-3 months, a better model will launch that solves that mistake naturally. Build for the *next* model's capabilities.
- **Ignoring the "Human Analogy":** When an eval fails, ask: "How would I teach a human to do this?" If a human would need a checklist or a peer review, build that into your model's chain-of-thought process.
- **Relying on "Vibes" for Launch:** Never ship a model update because it "feels better" on three prompts. Only ship if the aggregate eval score shows statistically significant improvement.

Related Skills

assertion-design

from diegosouzapw/awesome-omni-skill

SystemVerilog Assertions (SVA) as executable specifications. Use when defining timing requirements, protocol specifications, or formal properties for RTL verification.

ascii-ui-designer

from diegosouzapw/awesome-omni-skill

Create high-quality ASCII art UI/UX previews for web development with a two-phase approach. Phase 1: Design & Preview - visualize interfaces, explore layouts, refine ideas in ASCII format without code. Phase 2: Implementation - when ready, get HTML/CSS/React code and design tokens. Use for exploring ideas, getting stakeholder feedback, and iterating on design before development.

ascii-design-reviewer

from diegosouzapw/awesome-omni-skill

Review Phase 1 ASCII UI designs from a product owner perspective. Analyze user journeys, identify potential issues, ask clarifying questions about requirements and user flows, create Mermaid diagrams (flowcharts, sequence diagrams, state charts), provide detailed system behavior documentation, and document error handling strategies. Use when reviewing ASCII mockups to validate design against actual user needs, understand system workflows, and ensure completeness before moving to implementation.

---name: armored-cart-design-agent

from diegosouzapw/awesome-omni-skill

description: AI-powered design of armored CAR-T cells with cytokine/chemokine expression for enhanced solid tumor efficacy, including IL-12, IL-15, IL-18, and IL-7 armoring strategies.

archetype-designer

from diegosouzapw/awesome-omni-skill

Design and manage TraitorSim agent archetypes with OCEAN personality traits, stat biases, and gameplay profiles. Use when creating new archetypes, modifying personality traits, defining character types, or when asked about archetype design, OCEAN traits, Big Five personality, or character templates.

Arcanea Design System

from diegosouzapw/awesome-omni-skill

Complete visual design language for Arcanea - cosmic theme tokens, component patterns, animation standards, and Academy-specific aesthetics

apple-ui-design

from diegosouzapw/awesome-omni-skill

Apple-inspired clean, minimal, premium UI design. Use when building modern interfaces requiring exceptional UX, clean aesthetics, or Apple-like polish. Triggers on: clean UI, modern design, Apple style, minimal, premium, user-friendly, UX.

apple-hig-designer

from diegosouzapw/awesome-omni-skill

Design iOS apps following Apple's Human Interface Guidelines. Generate native components, validate designs, and ensure accessibility compliance for iPhone, iPad, and Apple Watch.

apple-design

from diegosouzapw/awesome-omni-skill

Create Apple-inspired modern, minimalist UI designs with glassmorphism, smooth animations, generous whitespace, and elegant typography. Use when designing portfolio websites, landing pages, hero sections, product showcases, or implementing Apple-style components, dark mode, or responsive layouts.

api-design-agent

from diegosouzapw/awesome-omni-skill

Designs RESTful and GraphQL APIs with clear contracts and documentation

apex-os-design

from diegosouzapw/awesome-omni-skill

Generates premium dark-mode UI for Apex OS wellness app (React Native + Expo 54). Use when designing screens, creating components, making layout decisions, generating data visualizations, or writing frontend code. Covers color system, typography, motion, haptics, and component patterns for a Bloomberg-meets-Calm aesthetic. References APEX_OS_PRD_v8.1.md for product logic and APEX_OS_BRAND_GUIDE.md for voice.

---name: antibody-design-agent

from diegosouzapw/awesome-omni-skill

description: An advanced agent for de novo antibody design and optimization using state-of-the-art protein language models (MAGE, RFdiffusion).