experiment-design

A/B testing and experimentation workflow: hypothesis design, metric selection, sample size calculation, statistical significance, common pitfalls (peeking, SRM, novelty effect), and experiment lifecycle. Complements feature-flags (implementation) with statistical rigor.

8 stars

Best use case

experiment-design is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

A/B testing and experimentation workflow: hypothesis design, metric selection, sample size calculation, statistical significance, common pitfalls (peeking, SRM, novelty effect), and experiment lifecycle. Complements feature-flags (implementation) with statistical rigor.

Teams using experiment-design should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/experiment-design/SKILL.md --create-dirs "https://raw.githubusercontent.com/marvinrichter/clarc/main/skills/experiment-design/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/experiment-design/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How experiment-design Compares

Feature / Agentexperiment-designStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

A/B testing and experimentation workflow: hypothesis design, metric selection, sample size calculation, statistical significance, common pitfalls (peeking, SRM, novelty effect), and experiment lifecycle. Complements feature-flags (implementation) with statistical rigor.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Experiment Design

> **Scope**: The *what* and *why* of experiments — statistical methodology, hypothesis design, and result analysis.
> For *how* to implement the flag that powers the experiment, see [feature-flags](../feature-flags/SKILL.md).

## When to Activate

- Designing an A/B test for a product change
- Choosing the right success metric for an experiment
- Calculating required sample size before launching
- Analyzing experiment results and deciding ship/kill
- Spotting and fixing a flawed experiment (SRM, peeking, novelty)
- Setting up an experimentation platform

---

## The Experiment Lifecycle

```
1. Hypothesis     →  What change, what metric, what direction?
2. Metric choice  →  Primary metric + guardrail metrics
3. Sample size    →  How many users, how long?
4. Launch         →  Random assignment via feature flag
5. Monitor        →  SRM check, guardrail watch (don't peek at results)
6. Analyze        →  Statistical significance, practical significance
7. Decision       →  Ship, iterate, or kill
8. Document       →  Record results for future reference
```

---

## Step 1: Hypothesis

A good hypothesis has three parts:

```
IF we [change],
THEN [metric] will [increase/decrease],
BECAUSE [mechanism].
```

**Good:**
> If we reduce checkout steps from 5 to 3, then checkout completion rate will increase, because fewer steps means less friction for users who abandon mid-flow.

**Bad:**
> If we redesign checkout, then users will like it more.
> *(No specific metric, no mechanism, "like" is unmeasurable)*

---

## Step 2: Metric Selection

### Primary metric (one only)
The single number that determines ship/kill. Choose the metric closest to the behavior you're changing:

| Change type | Good primary metric |
|-------------|-------------------|
| Checkout flow | Checkout completion rate |
| Email subject line | Open rate |
| Onboarding | Day-7 retention |
| Search relevance | Click-through rate |
| Pricing page | Plan upgrade rate |

### Guardrail metrics (2-5)
Metrics that must not degrade significantly:
- Revenue per user (always)
- Error rate / p99 latency (always)
- Support ticket rate
- Churn rate

A significant guardrail regression = kill the experiment, even if primary metric wins.

### Metric types

| Type | Formula | Notes |
|------|---------|-------|
| Conversion rate | converted / exposed | Binary, use z-test |
| Mean | sum / count | Continuous, use t-test |
| Ratio | numerator / denominator | Use delta method for variance |
| Retention | active at day N / cohort | Use survival analysis |

---

## Step 3: Sample Size Calculation

Required inputs:
- **Baseline rate** (current conversion rate, e.g. 5%)
- **MDE** (minimum detectable effect — smallest change worth shipping, e.g. +10% relative = 5.5%)
- **α** (false positive rate, default 0.05)
- **Power** (1 - false negative rate, default 0.80)

### Quick formula (two-proportion z-test)

```python
import math

def sample_size_per_variant(p_baseline, mde_relative, alpha=0.05, power=0.80):
    """
    p_baseline: current conversion rate (e.g. 0.05 for 5%)
    mde_relative: minimum detectable effect as relative change (e.g. 0.10 for +10%)
    """
    p_treatment = p_baseline * (1 + mde_relative)
    p_pooled = (p_baseline + p_treatment) / 2

    z_alpha = 1.96  # for alpha=0.05 two-tailed
    z_power = 0.84  # for power=0.80

    numerator = (z_alpha * math.sqrt(2 * p_pooled * (1 - p_pooled)) +
                 z_power * math.sqrt(p_baseline * (1 - p_baseline) +
                                     p_treatment * (1 - p_treatment))) ** 2
    denominator = (p_treatment - p_baseline) ** 2

    return math.ceil(numerator / denominator)

# Example: 5% baseline, need to detect +10% relative improvement
n = sample_size_per_variant(0.05, 0.10)
print(f"Need {n:,} users per variant ({n*2:,} total)")
# → Need ~31,000 per variant
```

### Runtime estimation

```
Days to run = sample_size_per_variant / (daily_exposed_users / num_variants)
```

**Rules:**
- Run for at least **1 full business cycle** (typically 2 weeks) regardless of significance
- Never end early because it looks good (peeking problem)
- Never extend because it "almost" reached significance

---

## Step 4: Random Assignment

Use your feature flag system for assignment. Critical requirements:

1. **Consistent hashing** — same user always gets same variant
2. **Unit of randomization** matches unit of analysis:
   - User-level metric → randomize by user_id
   - Session-level metric → randomize by session_id (only if users don't span sessions)
3. **Independent across experiments** — use different hash seeds per experiment
4. **Log assignment** — record which variant each user received at assignment time

```typescript
function assignVariant(userId: string, experimentId: string): 'control' | 'treatment' {
  const hash = murmurhash3(`${experimentId}:${userId}`);
  const bucket = hash % 100; // 0-99
  return bucket < 50 ? 'control' : 'treatment';
}
```

---

## Step 5: SRM Check (Sample Ratio Mismatch)

Before analyzing results, verify your assignment split is as expected:

```python
from scipy.stats import chi2_contingency
import numpy as np

def check_srm(control_count, treatment_count, expected_split=0.5):
    total = control_count + treatment_count
    expected_control = total * expected_split
    expected_treatment = total * (1 - expected_split)

    chi2, p, _, _ = chi2_contingency([
        [control_count, treatment_count],
        [expected_control, expected_treatment]
    ])
    return p  # p < 0.01 → SRM detected, do not trust results

p = check_srm(control_count=10234, treatment_count=9456)
if p < 0.01:
    print(f"⚠ SRM detected (p={p:.4f}) — investigate before analyzing")
```

SRM causes: logging bugs, bot traffic, cache serving stale variants, redirect chains.

---

## Step 6: Result Analysis

### Two-proportion z-test (conversion rates)

```python
from statsmodels.stats.proportion import proportions_ztest
import numpy as np

def analyze_experiment(control_conversions, control_n,
                       treatment_conversions, treatment_n):
    count = np.array([treatment_conversions, control_conversions])
    nobs  = np.array([treatment_n, control_n])

    z_stat, p_value = proportions_ztest(count, nobs)

    p_control   = control_conversions / control_n
    p_treatment = treatment_conversions / treatment_n
    lift        = (p_treatment - p_control) / p_control

    print(f"Control:   {p_control:.3%}")
    print(f"Treatment: {p_treatment:.3%}")
    print(f"Lift:      {lift:+.2%}")
    print(f"p-value:   {p_value:.4f} ({'significant' if p_value < 0.05 else 'not significant'})")

    return p_value < 0.05 and lift > 0
```

### 95% confidence interval

```python
from statsmodels.stats.proportion import proportion_confint

lo, hi = proportion_confint(
    treatment_conversions, treatment_n, alpha=0.05, method='wilson'
)
print(f"95% CI for treatment rate: [{lo:.3%}, {hi:.3%}]")
```

---

## Common Pitfalls

| Pitfall | Description | Fix |
|---------|-------------|-----|
| **Peeking** | Stopping early when you see p < 0.05 | Pre-commit to end date; use sequential testing if you need early stopping |
| **SRM** | Assignment split doesn't match expectation | Always run SRM check before analysis |
| **Novelty effect** | Users behave differently just because it's new | Run for at least 2 weeks; analyze long-term cohort |
| **Multiple testing** | Testing 5 metrics, one will be p < 0.05 by chance | Pre-register one primary metric; apply Bonferroni for secondary |
| **Network effects** | User A's treatment affects user B (social features) | Use cluster randomization (by household, team, etc.) |
| **Carryover** | User saw control, now in treatment and remembers | Washout period; exclude switchers |
| **Simpson's paradox** | Aggregate shows win, but loses in every segment | Segment by platform/country/plan before reporting |

---

## Decision Framework

| Result | Action |
|--------|--------|
| Significant positive primary, no guardrail regressions | Ship |
| Significant positive primary, guardrail regression | Investigate regression, fix and re-test |
| Significant negative primary | Kill |
| Not significant | Ship only if there's a strong qualitative reason; otherwise kill |
| SRM detected | Fix assignment bug, restart experiment |

---

## Experiment Documentation Template

```markdown
## Experiment: [Name]

**Hypothesis**: If [change], then [metric] will [direction], because [mechanism]
**Primary metric**: [metric name] (baseline: X%)
**MDE**: [X]% relative
**Guardrail metrics**: [list]
**Sample size**: [N per variant]
**Runtime**: [start] → [end] (N days)

### Results
| Metric | Control | Treatment | Lift | p-value | Significant? |
|--------|---------|-----------|------|---------|--------------|
| [primary] | | | | | |
| [guardrail 1] | | | | | |

**SRM check**: p=[X] ✓ / ✗
**Decision**: [Ship / Kill / Iterate]
**Learnings**: [What did we learn?]
```

---

## Related

- [feature-flags](../feature-flags/SKILL.md) — implementation of flag-based assignment
- [analytics-workflow](../analytics-workflow/SKILL.md) — tracking experiment exposure and conversion events
- `/experiment` command — walk through designing a new experiment

Related Skills

typography-design

8
from marvinrichter/clarc

Typography as a creative discipline: typeface selection criteria, type pairing (serif + sans, display + body), modular scale systems, line-height and tracking ratios, hierarchy construction, and web/mobile rendering considerations. The decisions behind design tokens, not the tokens themselves.

sdk-design-patterns

8
from marvinrichter/clarc

SDK design patterns — API ergonomics, backward compatibility (semantic versioning, deprecation), multi-language SDK generation (openapi-generator vs Speakeasy), error hierarchy design, SDK testing strategies, and documentation as first-class SDK artifact.

presentation-design

8
from marvinrichter/clarc

Presentation structure, narrative design, and slide layout principles. Covers the problem-solution-evidence arc, slide density rules (one idea per slide), slide type catalogue, opening hooks, and closing patterns. Use when structuring any slide deck — conference talk, demo, investor pitch, or team update.

marketing-design

8
from marvinrichter/clarc

Marketing asset design for developers: Open Graph / social media card specs and HTML generation, email template HTML/CSS patterns (table-based layout, Outlook compatibility, dark mode), banner and ad creative dimensions, print-safe PDF generation, and brand consistency across marketing touchpoints. From OG image code to email that renders in Outlook.

liquid-glass-design

8
from marvinrichter/clarc

iOS 26 Liquid Glass design system — dynamic glass material with blur, reflection, and interactive morphing for SwiftUI, UIKit, and WidgetKit.

generative-ai-design

8
from marvinrichter/clarc

AI-assisted design workflows: prompt engineering for image generation (Midjourney, DALL-E 3, Stable Diffusion, Flux), achieving style consistency across a generated asset set, post-processing AI outputs for production use, legal and licensing considerations, and when AI generation is and isn't appropriate. For teams integrating generative AI into their design workflow.

design-system

8
from marvinrichter/clarc

Design system architecture: design tokens (color, spacing, typography, radius), component library layers (Primitive → Composite → Pattern), theming with CSS Custom Properties and Tailwind, Storybook documentation, and dark mode. The foundation for consistent UI across an entire product.

design-ops

8
from marvinrichter/clarc

Design Operations: Figma file organization standards, design-to-dev handoff workflow, design QA checklist, design token sync pipeline (Figma Variables → Style Dictionary → CSS/Tailwind), design system versioning and governance, component audit methodology, and design-dev collaboration patterns. Bridges the gap between design tools and production code.

dashboard-design

8
from marvinrichter/clarc

Dashboard architecture and UX: KPI hierarchy, information density decisions, filter patterns, drill-down navigation, real-time update strategies (polling vs. WebSocket vs. SSE), empty and loading states for charts, and responsive dashboard layouts. Use when designing or building any analytics dashboard.

api-design-advanced

8
from marvinrichter/clarc

Advanced API design — per-language implementation patterns (TypeScript/Next.js, Go/net-http), anti-patterns (200 for everything, 500 for validation, contract breaking), and full pre-ship checklist.

api-design

8
from marvinrichter/clarc

REST API design patterns including resource naming, status codes, pagination, filtering, error responses, versioning, and rate limiting for production APIs.

zero-trust-patterns

8
from marvinrichter/clarc

Zero-Trust security patterns — mTLS between microservices (Istio/SPIFFE), SPIRE workload identity, OPA/Envoy authorization, NetworkPolicy default-deny-all, short-lived credentials, service mesh security, and Kubernetes RBAC hardening.