ai-eval-design-and-iteration
Develop "quizzes" (evals) to measure model performance on specific tasks. Use these benchmarks to guide fine-tuning, determine product UX patterns, and track performance improvements over time. Use this when launching a new AI feature, switching between model versions, or optimizing for high-stakes accuracy.
Best use case
ai-eval-design-and-iteration is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Develop "quizzes" (evals) to measure model performance on specific tasks. Use these benchmarks to guide fine-tuning, determine product UX patterns, and track performance improvements over time. Use this when launching a new AI feature, switching between model versions, or optimizing for high-stakes accuracy.
Teams using ai-eval-design-and-iteration should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/ai-eval-design-and-iteration/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How ai-eval-design-and-iteration Compares
| Feature / Agent | ai-eval-design-and-iteration | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Develop "quizzes" (evals) to measure model performance on specific tasks. Use these benchmarks to guide fine-tuning, determine product UX patterns, and track performance improvements over time. Use this when launching a new AI feature, switching between model versions, or optimizing for high-stakes accuracy.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# AI Eval Design and Iteration In traditional software, inputs and outputs are defined. In AI, inputs and outputs are fuzzy. Evals (evaluations) are the "unit tests" for AI products. They allow you to move from "vibes-based" development to metric-driven iteration. By building a rigorous "quiz" for your model, you can determine exactly how capable your product is and where it requires human-in-the-loop scaffolding. ## The Eval Workflow ### 1. Identify "Hero Use Cases" Don't start with generic benchmarks (like MMLU). Instead, define the specific "hero" scenarios your product must master. - Identify the 10–20 most common or high-value queries users will give your model. - For each query, define what a "Perfect/Gold" answer looks like. - Include edge cases where you expect the model to struggle (e.g., complex reasoning or specific formatting). ### 2. Design the "Quiz" (The Eval) Create a set of tests to gauge how well the model knows the subject material. - **Input:** The specific prompt or instruction. - **Reference:** The "Gold" standard answer or a set of criteria (e.g., "Must mention X," "Must not exceed 200 words"). - **Scoring Mechanism:** Use a more powerful model (like O1 or GPT-4o) to grade the output of your production model based on your criteria. ### 3. Apply the "Hill Climbing" Process Use the eval scores to guide your development cycle. - Run the eval on your baseline model. - **Fine-Tune:** If scores are low, provide 1,000+ examples of "Problem -> Good Answer" to the model to "teach" it the specific task. - **Re-Test:** Run the eval again to see if performance increased. - **Iterate:** If performance plateaus, break the problem down into smaller tasks (ensembling) and create specific evals for each sub-task. ### 4. Determine UX Based on Accuracy Thresholds The "score" of your eval dictates the product's user interface. Kevin Weil's 60/95/99 Rule: - **60% Accuracy:** Build a "Co-pilot" or "Draft" experience where the user must heavily edit the output. - **95% Accuracy:** Build a "Human-in-the-loop" experience where the model does the work, and a human briefly reviews it. - **99.5% Accuracy:** Build an "Agentic" or "Automated" experience where the model acts autonomously. ## Examples **Example 1: Deep Research Tool** - **Context:** Building a tool that researches a topic for 30 minutes and writes a 20-page report. - **The Eval:** A prompt asking to "Compare the competitive landscape of fusion energy companies in 2024." - **Criteria:** Does it mention Helion? Does it cite sources? Is the report 15+ pages? - **Application:** If the model gets the history right but misses current news, the team adds an eval specifically for "Recency" and fine-tunes the browsing tool. **Example 2: Customer Support Agent** - **Context:** An automated agent to handle refunds and technical questions. - **The Eval:** 500 historic tickets with verified "correct" resolutions. - **Application:** The team finds the model is 98% accurate on refunds but only 70% on technical debugging. - **Output:** The UX is designed to automate refunds instantly but route all technical questions to a human agent with a "suggested" draft. ## Common Pitfalls - **Using Static Evals:** AI models and user behaviors change every few months. If you don't update your "quiz" to reflect new capabilities or user errors, your metrics will become meaningless. - **Over-Scaffolding for Today's Model:** Avoid building complex "if/then" code to fix a model's current mistake. In 2-3 months, a better model will launch that solves that mistake naturally. Build for the *next* model's capabilities. - **Ignoring the "Human Analogy":** When an eval fails, ask: "How would I teach a human to do this?" If a human would need a checklist or a peer review, build that into your model's chain-of-thought process. - **Relying on "Vibes" for Launch:** Never ship a model update because it "feels better" on three prompts. Only ship if the aggregate eval score shows statistically significant improvement.
Related Skills
assertion-design
SystemVerilog Assertions (SVA) as executable specifications. Use when defining timing requirements, protocol specifications, or formal properties for RTL verification.
ascii-ui-designer
Create high-quality ASCII art UI/UX previews for web development with a two-phase approach. Phase 1: Design & Preview - visualize interfaces, explore layouts, refine ideas in ASCII format without code. Phase 2: Implementation - when ready, get HTML/CSS/React code and design tokens. Use for exploring ideas, getting stakeholder feedback, and iterating on design before development.
ascii-design-reviewer
Review Phase 1 ASCII UI designs from a product owner perspective. Analyze user journeys, identify potential issues, ask clarifying questions about requirements and user flows, create Mermaid diagrams (flowcharts, sequence diagrams, state charts), provide detailed system behavior documentation, and document error handling strategies. Use when reviewing ASCII mockups to validate design against actual user needs, understand system workflows, and ensure completeness before moving to implementation.
---name: armored-cart-design-agent
description: AI-powered design of armored CAR-T cells with cytokine/chemokine expression for enhanced solid tumor efficacy, including IL-12, IL-15, IL-18, and IL-7 armoring strategies.
archetype-designer
Design and manage TraitorSim agent archetypes with OCEAN personality traits, stat biases, and gameplay profiles. Use when creating new archetypes, modifying personality traits, defining character types, or when asked about archetype design, OCEAN traits, Big Five personality, or character templates.
Arcanea Design System
Complete visual design language for Arcanea - cosmic theme tokens, component patterns, animation standards, and Academy-specific aesthetics
apple-ui-design
Apple-inspired clean, minimal, premium UI design. Use when building modern interfaces requiring exceptional UX, clean aesthetics, or Apple-like polish. Triggers on: clean UI, modern design, Apple style, minimal, premium, user-friendly, UX.
apple-hig-designer
Design iOS apps following Apple's Human Interface Guidelines. Generate native components, validate designs, and ensure accessibility compliance for iPhone, iPad, and Apple Watch.
apple-design
Create Apple-inspired modern, minimalist UI designs with glassmorphism, smooth animations, generous whitespace, and elegant typography. Use when designing portfolio websites, landing pages, hero sections, product showcases, or implementing Apple-style components, dark mode, or responsive layouts.
api-design-agent
Designs RESTful and GraphQL APIs with clear contracts and documentation
apex-os-design
Generates premium dark-mode UI for Apex OS wellness app (React Native + Expo 54). Use when designing screens, creating components, making layout decisions, generating data visualizations, or writing frontend code. Covers color system, typography, motion, haptics, and component patterns for a Bloomberg-meets-Calm aesthetic. References APEX_OS_PRD_v8.1.md for product logic and APEX_OS_BRAND_GUIDE.md for voice.
---name: antibody-design-agent
description: An advanced agent for de novo antibody design and optimization using state-of-the-art protein language models (MAGE, RFdiffusion).