ab-test-calculator

Calculate statistical significance for A/B tests. Sample size estimation, power analysis, and conversion rate comparisons with confidence intervals.

16 stars

Best use case

ab-test-calculator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Calculate statistical significance for A/B tests. Sample size estimation, power analysis, and conversion rate comparisons with confidence intervals.

Teams using ab-test-calculator should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ab-test-calculator/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/testing-security/ab-test-calculator/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/ab-test-calculator/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How ab-test-calculator Compares

Feature / Agentab-test-calculatorStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Calculate statistical significance for A/B tests. Sample size estimation, power analysis, and conversion rate comparisons with confidence intervals.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# A/B Test Calculator

Statistical significance testing for A/B experiments with power analysis and sample size estimation.

## Features

- **Significance Testing**: Chi-square, Z-test, T-test for conversions
- **Sample Size Estimation**: Calculate required samples for desired power
- **Power Analysis**: Determine test power given sample size
- **Confidence Intervals**: Calculate CIs for conversion rates
- **Multiple Variants**: Support A/B/n testing
- **Bayesian Analysis**: Probability to beat baseline

## Quick Start

```python
from ab_test_calc import ABTestCalculator

calc = ABTestCalculator()

# Test significance
result = calc.test_significance(
    control_visitors=10000,
    control_conversions=500,
    variant_visitors=10000,
    variant_conversions=550
)

print(f"Significant: {result['significant']}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Lift: {result['lift']:.2%}")
```

## CLI Usage

```bash
# Test significance
python ab_test_calc.py --test 10000 500 10000 550

# Calculate sample size
python ab_test_calc.py --sample-size --baseline 0.05 --mde 0.10 --power 0.8

# Power analysis
python ab_test_calc.py --power-analysis --baseline 0.05 --mde 0.10 --samples 5000

# Bayesian analysis
python ab_test_calc.py --bayesian 10000 500 10000 550

# Multiple variants
python ab_test_calc.py --test-multi 10000 500 10000 550 10000 520
```

## API Reference

### ABTestCalculator Class

```python
class ABTestCalculator:
    def __init__(self, alpha: float = 0.05)

    # Significance testing
    def test_significance(self, control_visitors: int, control_conversions: int,
                         variant_visitors: int, variant_conversions: int,
                         test: str = "chi_square") -> dict

    # Sample size calculation
    def calculate_sample_size(self, baseline_rate: float,
                             minimum_detectable_effect: float,
                             power: float = 0.8,
                             alpha: float = 0.05) -> dict

    # Power analysis
    def calculate_power(self, baseline_rate: float,
                       minimum_detectable_effect: float,
                       sample_size: int,
                       alpha: float = 0.05) -> dict

    # Confidence interval
    def confidence_interval(self, visitors: int, conversions: int,
                           confidence: float = 0.95) -> dict

    # Bayesian analysis
    def bayesian_analysis(self, control_visitors: int, control_conversions: int,
                         variant_visitors: int, variant_conversions: int,
                         simulations: int = 100000) -> dict

    # Multiple variants
    def test_multiple_variants(self, control: tuple, variants: list,
                              correction: str = "bonferroni") -> dict

    # Duration estimation
    def estimate_duration(self, daily_visitors: int, baseline_rate: float,
                         minimum_detectable_effect: float,
                         power: float = 0.8) -> dict
```

## Test Methods

### Chi-Square Test (Default)
Best for comparing conversion rates between groups.

```python
result = calc.test_significance(
    control_visitors=10000,
    control_conversions=500,
    variant_visitors=10000,
    variant_conversions=550,
    test="chi_square"
)
```

### Z-Test for Proportions
Good for large sample sizes.

```python
result = calc.test_significance(
    control_visitors=10000,
    control_conversions=500,
    variant_visitors=10000,
    variant_conversions=550,
    test="z_test"
)
```

## Sample Size Estimation

Calculate the number of visitors needed per variant:

```python
result = calc.calculate_sample_size(
    baseline_rate=0.05,          # Current conversion rate (5%)
    minimum_detectable_effect=0.10,  # 10% relative improvement
    power=0.8,                   # 80% power
    alpha=0.05                   # 5% significance level
)

# Returns:
{
    "sample_size_per_variant": 31234,
    "total_sample_size": 62468,
    "baseline_rate": 0.05,
    "expected_variant_rate": 0.055,
    "minimum_detectable_effect": 0.10,
    "power": 0.8,
    "alpha": 0.05
}
```

## Power Analysis

Calculate the probability of detecting an effect:

```python
result = calc.calculate_power(
    baseline_rate=0.05,
    minimum_detectable_effect=0.10,
    sample_size=25000,
    alpha=0.05
)

# Returns:
{
    "power": 0.72,
    "interpretation": "72% chance of detecting the effect if it exists"
}
```

## Bayesian Analysis

Get probability that variant beats control:

```python
result = calc.bayesian_analysis(
    control_visitors=10000,
    control_conversions=500,
    variant_visitors=10000,
    variant_conversions=550
)

# Returns:
{
    "prob_variant_better": 0.9523,
    "prob_control_better": 0.0477,
    "expected_lift": 0.098,
    "credible_interval_95": [0.02, 0.18]
}
```

## Multiple Variant Testing

Test multiple variants with correction for multiple comparisons:

```python
result = calc.test_multiple_variants(
    control=(10000, 500),          # (visitors, conversions)
    variants=[
        (10000, 550),              # Variant A
        (10000, 520),              # Variant B
        (10000, 480)               # Variant C
    ],
    correction="bonferroni"        # or "holm", "none"
)

# Returns:
{
    "control": {"visitors": 10000, "conversions": 500, "rate": 0.05},
    "variants": [
        {"visitors": 10000, "conversions": 550, "rate": 0.055,
         "lift": 0.10, "p_value": 0.012, "significant": True},
        ...
    ],
    "winner": "Variant A",
    "correction_method": "bonferroni"
}
```

## Output Format

### Significance Test Result

```python
{
    "significant": True,
    "p_value": 0.0234,
    "control_rate": 0.05,
    "variant_rate": 0.055,
    "lift": 0.10,
    "lift_absolute": 0.005,
    "confidence_interval": {
        "lower": 0.02,
        "upper": 0.18
    },
    "test_method": "chi_square",
    "alpha": 0.05,
    "recommendation": "Variant shows significant improvement"
}
```

## Example Workflows

### Pre-Test Planning

```python
calc = ABTestCalculator()

# 1. Estimate required sample size
sample = calc.calculate_sample_size(
    baseline_rate=0.03,     # Current 3% conversion
    minimum_detectable_effect=0.15,  # Want to detect 15% lift
    power=0.8
)
print(f"Need {sample['sample_size_per_variant']} visitors per variant")

# 2. Estimate test duration
duration = calc.estimate_duration(
    daily_visitors=5000,
    baseline_rate=0.03,
    minimum_detectable_effect=0.15
)
print(f"Test will take ~{duration['days']} days")
```

### Post-Test Analysis

```python
calc = ABTestCalculator()

# 1. Test significance
result = calc.test_significance(
    control_visitors=15000,
    control_conversions=450,
    variant_visitors=15000,
    variant_conversions=525
)

# 2. Get Bayesian probability
bayes = calc.bayesian_analysis(15000, 450, 15000, 525)

print(f"P-value: {result['p_value']:.4f}")
print(f"Lift: {result['lift']:.2%}")
print(f"Probability variant wins: {bayes['prob_variant_better']:.1%}")
```

## Dependencies

- scipy>=1.10.0
- numpy>=1.24.0
- statsmodels>=0.14.0

Related Skills

agent-test-automator

16
from diegosouzapw/awesome-omni-skill

Expert test automation engineer specializing in building robust test frameworks, CI/CD integration, and comprehensive test coverage. Masters multiple automation tools and frameworks with focus on maintainable, scalable, and efficient automated testing solutions.

agent-penetration-tester

16
from diegosouzapw/awesome-omni-skill

Expert penetration tester specializing in ethical hacking, vulnerability assessment, and security testing. Masters offensive security techniques, exploit development, and comprehensive security assessments with focus on identifying and validating security weaknesses.

agent-accessibility-tester

16
from diegosouzapw/awesome-omni-skill

Expert accessibility tester specializing in WCAG compliance, inclusive design, and universal access. Masters screen reader compatibility, keyboard navigation, and assistive technology integration with focus on creating barrier-free digital experiences.

add-unit-tests

16
from diegosouzapw/awesome-omni-skill

Guide for adding unit tests to AReaL. Use when user wants to add tests for new functionality or increase test coverage.

accessibility-testing

16
from diegosouzapw/awesome-omni-skill

WCAG compliance testing and accessibility quality assurance workflows for iOS apps. Use when validating accessibility labels, testing VoiceOver compatibility, checking contrast ratios, or ensuring WCAG 2.1 compliance. Covers accessibility tree analysis, semantic validation, and automated accessibility testing patterns.

accessibility-tester

16
from diegosouzapw/awesome-omni-skill

Expert accessibility tester specializing in WCAG compliance, inclusive design, and universal access. Masters screen reader compatibility, keyboard navigation, and assistive technology integration with focus on creating barrier-free digital experiences.

accessibility-test-axe

16
from diegosouzapw/awesome-omni-skill

Эксперт по a11y тестированию. Используй для axe-core, automated testing и accessibility audits.

acceptance-tester

16
from diegosouzapw/awesome-omni-skill

Execute systematic acceptance testing to verify implementations against acceptance criteria. Use this skill when tasks mention "驗收測試", "acceptance testing", "驗收", "validate implementation", or when Gherkin scenarios need to be executed.

acceptance-test-writing

16
from diegosouzapw/awesome-omni-skill

Guide for writing high-quality acceptance criteria and acceptance tests using industry-standard BDD (Behavior-Driven Development) and ATDD (Acceptance Test-Driven Development) practices. Use this skill when creating acceptance criteria for user stories, writing Gherkin scenarios, or implementing acceptance test specifications following Given-When-Then format.

acceptance-test-driven-development

16
from diegosouzapw/awesome-omni-skill

Write acceptance tests before unit tests to ensure you're building the right thing

acc-testing-knowledge

16
from diegosouzapw/awesome-omni-skill

Testing knowledge base for PHP 8.5 projects. Provides testing pyramid, AAA pattern, naming conventions, isolation principles, DDD testing guidelines, and PHPUnit patterns.

acc-detect-test-smells

16
from diegosouzapw/awesome-omni-skill

Detects test antipatterns and code smells in PHP test suites. Identifies 15 smells (Logic in Test, Mock Overuse, Fragile Tests, Mystery Guest, etc.) with fix recommendations and refactoring patterns for testability.