ai-product-evaluation-design

Transition from traditional PRDs to "Evals" (evaluations) to guide AI model behavior. Use this skill when launching new AI features, debugging unpredictable model outputs, or moving from a prompted prototype to a production-ready agent.

16 stars

Best use case

ai-product-evaluation-design is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Transition from traditional PRDs to "Evals" (evaluations) to guide AI model behavior. Use this skill when launching new AI features, debugging unpredictable model outputs, or moving from a prompted prototype to a production-ready agent.

Teams using ai-product-evaluation-design should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ai-product-evaluation-design/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/design/ai-product-evaluation-design/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/ai-product-evaluation-design/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How ai-product-evaluation-design Compares

Feature / Agentai-product-evaluation-designStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Transition from traditional PRDs to "Evals" (evaluations) to guide AI model behavior. Use this skill when launching new AI features, debugging unpredictable model outputs, or moving from a prompted prototype to a production-ready agent.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# AI Product Evaluation Design

In the era of LLMs, product development moves from writing static specifications to defining "correctness" through Evals. Since models are stochastic, you cannot "fix a bug" with a single line of code; instead, you must "hill climb" toward better behavior by building robust datasets that measure model performance against your product goals.

## The Three-Tier Evaluation Framework

Depending on the complexity of the feature, use one or more of these evaluation methods:

### 1. Deterministic Evals (Pass/Fail)
Best for extraction, tool-calling, or objective facts.
- **Goal:** Verify the model extracts the exact right data.
- **Example:** If the user says "Remind me to eat at 7 PM," the JSON output for `time` must be `19:00`.
- **Metric:** Accuracy % (Total correct / Total prompts).

### 2. Human Preference Evals (Side-by-Side)
Best for tone, creativity, and visual design (like the "Canvas" layout).
- **Goal:** Compare two model versions (e.g., a baseline vs. a new fine-tuned model).
- **Process:** Present a prompt and two anonymized completions. Ask a human rater: "Which is better for [Specific Goal]?"
- **Metric:** Win Rate (The percentage of time the new model beats the baseline).

### 3. Model-Graded Evals (LLM-as-a-Judge)
Best for scaling quality checks without manual labor.
- **Goal:** Use a high-reasoning model (like o1) to grade the output of a faster, cheaper model.
- **Process:** Give the "Judge" model the rubric of what a "good" response looks like and ask it to score the "Student" model on a scale of 1-5.

## Step-by-Step Process for Designing Evals

### 1. Create the "Ground Truth" Dataset
Build a spreadsheet with the following columns to define the model's target behavior:
- **Input/Prompt:** What the user says (include diverse variations).
- **Baseline Behavior:** How the current model responds.
- **Ideal Behavior:** A hand-written "Golden Response" showing exactly what you want.
- **Rationale:** Why the ideal behavior is better (e.g., "It didn't trigger the UI when it should have stayed in chat").

### 2. Define Decision Boundaries
For agentic features (like Canvas or Task execution), define the "Trigger Boundary":
- **Trigger Scenarios:** Prompt: "Write a 5-page essay." Result: Model opens the document editor.
- **Non-Trigger Scenarios:** Prompt: "Who is the President?" Result: Model stays in the standard chat interface.

### 3. Identify Performance Regressions
When you optimize for one skill (e.g., "Being more concise"), you may accidentally "brain damage" another skill (e.g., "Formatting code correctly").
- Always run your new feature evals alongside a "General Intelligence" eval set to ensure core reasoning hasn't dropped.

## Examples

**Example 1: Deterministic Eval for a "Tasks" Tool**
- **Context:** An AI assistant that sets reminders.
- **Input:** "Remind me to call Mom in two hours." (Sent at 10:00 AM).
- **Expected Output:** `{ "action": "set_reminder", "content": "Call Mom", "time": "12:00" }`.
- **Application:** Run 100 variations of time-based language ("tonight," "in a bit," "next Tuesday") to ensure the extraction logic holds.

**Example 2: Preference Eval for Writing Style**
- **Context:** Improving the "friendly" tone of a document editor.
- **Input:** "Rewrite this paragraph to be more encouraging."
- **Model A:** "You did a good job on the report."
- **Model B:** "This report is a fantastic start! Your analysis of the data is really sharp."
- **Evaluation:** Human rater chooses Model B because it uses specific positive reinforcement instead of generic praise.

## Common Pitfalls

- **Measuring the Wrong Baseline:** Using a weak model as your baseline makes your new model look better than it actually is. Always test against the "state of the art" (SOTA).
- **Neglecting Diversity:** Training or testing only on "happy path" prompts. Include edge cases, slang, and non-English inputs to ensure the model doesn't fail in the wild.
- **The "Over-Refusal" Trap:** Teaching a model to be too safe or helpful can cause it to start refusing valid requests (e.g., the "body paradox" where a model refuses to set an alarm because it "doesn't have a physical body").
- **Ignoring Latency:** A model that is 5% more accurate but 10x slower is often a net-negative for the user experience. Always include "Time to First Token" as an eval metric.

Related Skills

assertion-design

16
from diegosouzapw/awesome-omni-skill

SystemVerilog Assertions (SVA) as executable specifications. Use when defining timing requirements, protocol specifications, or formal properties for RTL verification.

ascii-ui-designer

16
from diegosouzapw/awesome-omni-skill

Create high-quality ASCII art UI/UX previews for web development with a two-phase approach. Phase 1: Design & Preview - visualize interfaces, explore layouts, refine ideas in ASCII format without code. Phase 2: Implementation - when ready, get HTML/CSS/React code and design tokens. Use for exploring ideas, getting stakeholder feedback, and iterating on design before development.

ascii-design-reviewer

16
from diegosouzapw/awesome-omni-skill

Review Phase 1 ASCII UI designs from a product owner perspective. Analyze user journeys, identify potential issues, ask clarifying questions about requirements and user flows, create Mermaid diagrams (flowcharts, sequence diagrams, state charts), provide detailed system behavior documentation, and document error handling strategies. Use when reviewing ASCII mockups to validate design against actual user needs, understand system workflows, and ensure completeness before moving to implementation.

---name: armored-cart-design-agent

16
from diegosouzapw/awesome-omni-skill

description: AI-powered design of armored CAR-T cells with cytokine/chemokine expression for enhanced solid tumor efficacy, including IL-12, IL-15, IL-18, and IL-7 armoring strategies.

archetype-designer

16
from diegosouzapw/awesome-omni-skill

Design and manage TraitorSim agent archetypes with OCEAN personality traits, stat biases, and gameplay profiles. Use when creating new archetypes, modifying personality traits, defining character types, or when asked about archetype design, OCEAN traits, Big Five personality, or character templates.

Arcanea Design System

16
from diegosouzapw/awesome-omni-skill

Complete visual design language for Arcanea - cosmic theme tokens, component patterns, animation standards, and Academy-specific aesthetics

apple-ui-design

16
from diegosouzapw/awesome-omni-skill

Apple-inspired clean, minimal, premium UI design. Use when building modern interfaces requiring exceptional UX, clean aesthetics, or Apple-like polish. Triggers on: clean UI, modern design, Apple style, minimal, premium, user-friendly, UX.

apple-hig-designer

16
from diegosouzapw/awesome-omni-skill

Design iOS apps following Apple's Human Interface Guidelines. Generate native components, validate designs, and ensure accessibility compliance for iPhone, iPad, and Apple Watch.

apple-design

16
from diegosouzapw/awesome-omni-skill

Create Apple-inspired modern, minimalist UI designs with glassmorphism, smooth animations, generous whitespace, and elegant typography. Use when designing portfolio websites, landing pages, hero sections, product showcases, or implementing Apple-style components, dark mode, or responsive layouts.

api-design-agent

16
from diegosouzapw/awesome-omni-skill

Designs RESTful and GraphQL APIs with clear contracts and documentation

apex-os-design

16
from diegosouzapw/awesome-omni-skill

Generates premium dark-mode UI for Apex OS wellness app (React Native + Expo 54). Use when designing screens, creating components, making layout decisions, generating data visualizations, or writing frontend code. Covers color system, typography, motion, haptics, and component patterns for a Bloomberg-meets-Calm aesthetic. References APEX_OS_PRD_v8.1.md for product logic and APEX_OS_BRAND_GUIDE.md for voice.

---name: antibody-design-agent

16
from diegosouzapw/awesome-omni-skill

description: An advanced agent for de novo antibody design and optimization using state-of-the-art protein language models (MAGE, RFdiffusion).