hypothesis-testing

Applies the scientific method to debugging by helping users form specific, testable hypotheses, design targeted experiments, and systematically confirm or reject theories to find root causes. Use when a user says their code isn't working, they're getting an error, something broke, they want to troubleshoot a bug, or they're trying to figure out what's causing an issue. Concrete actions include isolating failing components, forming and testing hypotheses, analyzing error messages, tracing execution paths, and interpreting test results to narrow down root causes.

737 stars

Best use case

hypothesis-testing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Applies the scientific method to debugging by helping users form specific, testable hypotheses, design targeted experiments, and systematically confirm or reject theories to find root causes. Use when a user says their code isn't working, they're getting an error, something broke, they want to troubleshoot a bug, or they're trying to figure out what's causing an issue. Concrete actions include isolating failing components, forming and testing hypotheses, analyzing error messages, tracing execution paths, and interpreting test results to narrow down root causes.

Teams using hypothesis-testing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/hypothesis-testing/SKILL.md --create-dirs "https://raw.githubusercontent.com/rohitg00/skillkit/main/packages/core/src/methodology/packs/debugging/hypothesis-testing/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/hypothesis-testing/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How hypothesis-testing Compares

Feature / Agenthypothesis-testingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Applies the scientific method to debugging by helping users form specific, testable hypotheses, design targeted experiments, and systematically confirm or reject theories to find root causes. Use when a user says their code isn't working, they're getting an error, something broke, they want to troubleshoot a bug, or they're trying to figure out what's causing an issue. Concrete actions include isolating failing components, forming and testing hypotheses, analyzing error messages, tracing execution paths, and interpreting test results to narrow down root causes.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Hypothesis-Driven Debugging

You are applying the scientific method to debugging. Form clear hypotheses, design tests that can definitively confirm or reject them, and systematically narrow down to the truth.

## Core Principle

**Every debugging action should test a specific hypothesis. Random changes are not debugging.**

## The Scientific Debugging Method

### 1. Observe - Gather Facts

Before forming hypotheses, collect observations:

- What exactly happens? (specific symptoms)
- When does it happen? (timing, frequency)
- Where does it happen? (environment, component)
- What changed recently? (code, config, data)

**Write down observations objectively:**
```
Observations:
- API returns 500 error on POST /orders
- Happens only when cart has > 10 items
- Started after deployment on 2024-01-15
- Works fine in staging environment
- Error logs show "connection refused" to inventory service
```

### 2. Hypothesize - Form Testable Theories

**Examples (bad → good):**
- ~~"Something is wrong with the network"~~ → "The inventory service connection pool is exhausted when processing orders with >10 items"
- ~~"There might be a race condition"~~ → "The order processing timeout (5s) is insufficient for large orders"

### 3. Predict - Define Expected Results

For each hypothesis, define what you expect to observe if it is true versus false:

```
Hypothesis: Connection pool exhausted for large orders

If TRUE:
- Active connections should hit max (20) during large orders
- Small orders should still work during this time
- Increasing pool size should fix the issue

If FALSE:
- Connection count stays well below max
- Small orders also fail during the issue
- Pool size change has no effect
```

### 4. Test - Experiment Systematically

Design tests that definitively confirm or reject:

```
Test Plan for Connection Pool Hypothesis:

1. Add connection pool monitoring
   - Log active connections before/after each request
   - Expected if true: Count reaches 20 during failures

2. Artificial stress test
   - Send 5 large orders simultaneously
   - Expected if true: Failures start when pool exhausted

3. Increase pool size to 50
   - Repeat stress test
   - Expected if true: Failures stop or threshold moves

4. Control test with small orders
   - Send 20 small orders simultaneously
   - Expected if true: No failures (faster processing)
```

### 5. Analyze - Interpret Results

After testing:

- Did results match predictions for TRUE or FALSE?
- Are results conclusive or ambiguous?
- Do results suggest a different hypothesis?

```
Results:
- Connection count reached 20/20 during failures ✓
- Small orders succeeded during same period ✓
- Pool size increase to 50 → failures stopped ✓

Conclusion: Hypothesis CONFIRMED
Connection pool exhaustion is the proximate cause.

New question: Why do large orders exhaust the pool?
New hypothesis: Large orders make multiple inventory calls per item
```

## Hypothesis Tracking Template

```markdown
## Bug: [Description]

### Hypothesis 1: [Theory]
**Status:** Testing | Confirmed | Rejected
**Probability:** High | Medium | Low

**Evidence For:**
- [Evidence 1]
- [Evidence 2]

**Evidence Against:**
- [Evidence 1]

**Test Plan:**
1. [Test 1] - Expected result if true
2. [Test 2] - Expected result if false

**Test Results:**
- [Result 1]: [Supports/Contradicts]
- [Result 2]: [Supports/Contradicts]

**Conclusion:** [Confirmed/Rejected] because [reasoning]

---

### Hypothesis 2: [Next Theory]
...
```

## Testing Techniques by Hypothesis Type

### Testing Timing Hypotheses
```typescript
// Add timing instrumentation
const start = performance.now();
await suspectedSlowOperation();
const duration = performance.now() - start;
console.log(`Operation took ${duration}ms`);
// Hypothesis confirmed if duration > expected
```

### Testing Data Hypotheses
```typescript
// Validate data at key points
function processWithValidation(data) {
  console.assert(data.id != null, 'Missing id');
  console.assert(data.items?.length > 0, 'Empty items');
  console.assert(typeof data.total === 'number', 'Invalid total');
  // If assertions fail, data hypothesis likely true
}
```

### Testing State Hypotheses
```typescript
// Snapshot state before and after
const stateBefore = JSON.stringify(currentState);
suspectedStateMutation();
const stateAfter = JSON.stringify(currentState);
if (stateBefore !== stateAfter) {
  console.log('State changed:', diff(stateBefore, stateAfter));
}
```

## Decision Tree

```
Is the hypothesis testable?
├── NO → Refine it to be more specific
└── YES → Can I test it without side effects?
    ├── NO → Design a safe test (staging, logs-only)
    └── YES → Run the test
        └── Results conclusive?
            ├── NO → Design a better test
            └── YES → Hypothesis confirmed or rejected?
                ├── CONFIRMED → Root cause found?
                │   ├── YES → Fix and verify
                │   └── NO → Form next hypothesis (why?)
                └── REJECTED → Form next hypothesis
```

## Integration with Other Skills

- **root-cause-analysis**: Hypothesis testing is a key technique within RCA
- **trace-and-isolate**: Use tracing to gather evidence for hypotheses
- **testing/red-green-refactor**: Write test that confirms the bug before fixing

Related Skills

testing-anti-patterns

737
from rohitg00/skillkit

Reviews test code to identify and fix common testing anti-patterns including flaky tests, over-mocking, brittle assertions, test interdependency, and hidden test logic. Flags bad patterns, explains the specific defect, and provides corrected implementations. Use when reviewing test code, debugging intermittent or unreliable test failures, or when the user mentions flaky tests, test smells, brittle tests, test isolation issues, mock overuse, slow tests, or test maintenance problems.

find-skills

737
from rohitg00/skillkit

Discovers, searches, and installs skills from multiple AI agent skill marketplaces (400K+ skills) using the SkillKit CLI. Supports browsing official partner collections (Anthropic, Vercel, Supabase, Stripe, and more) and community repositories, searching by domain or technology, and installing specific skills from GitHub. Use when the user wants to find, browse, or install new agent skills, plugins, extensions, or add-ons; asks 'is there a skill for X' or 'find a skill for X'; wants to explore a skill store or marketplace; needs to extend agent capabilities in areas like React, testing, DevOps, security, or APIs; or says 'browse skills', 'search skill marketplace', 'install a skill', or 'what skills are available'.

test-patterns

737
from rohitg00/skillkit

Applies proven testing patterns — Arrange-Act-Assert (AAA), Given-When-Then, Test Data Builders, Object Mother, parameterized tests, fixtures, spies, and test doubles — to help write maintainable, reliable, and readable test suites. Use when the user asks about writing unit tests, integration tests, or end-to-end tests; structuring test cases or test suites; applying TDD or BDD practices; working with mocks, stubs, spies, or fakes; improving test coverage or reducing flakiness; or needs guidance on test organization, naming conventions, or assertions in frameworks like Jest, Vitest, pytest, or similar.

red-green-refactor

737
from rohitg00/skillkit

Guides the red-green-refactor TDD workflow: write a failing test first, implement the minimum code to make it pass, then refactor while keeping tests green. Use when a user asks to practice TDD, write tests first, follow red-green-refactor, do test-driven development, write failing tests before code, or phrases like 'make the test pass', 'test coverage', or 'unit tests before implementation'.

verification-gates

737
from rohitg00/skillkit

Creates explicit validation checkpoints (verification gates) between project phases to catch errors early and ensure quality before proceeding. Use when the user asks about quality gates, milestone checks, phase transitions, approval steps, go/no-go decision points, or preventing cascading errors across a multi-step workflow. Produces acceptance criteria checklists, automated CI gate configurations, manual sign-off requirements, and conditional review rules for scenarios such as security changes, API changes, or database migrations.

task-decomposition

737
from rohitg00/skillkit

Breaks down complex software, writing, or research tasks into small, atomic, independently completable units with dependency graphs and milestone breakdowns. Use when the user asks to plan a project, decompose a feature, create subtasks, split up work, or needs help organizing a large piece of work into a step-by-step plan. Triggered by phrases like "break down", "decompose", "where do I start", "too big", "split into tasks", "work breakdown", or "task list".

design-first

737
from rohitg00/skillkit

Guides the creation of technical design documents before writing code, producing architecture diagrams, data models, API interface definitions, implementation plans, and multi-option trade-off analyses. Use when the user asks to plan a feature, architect a system, design an API, explore implementation approaches, or requests a technical design or spec before coding — especially for complex features involving multiple components, ambiguous requirements, or significant architectural changes.

skill-authoring

737
from rohitg00/skillkit

Creates and structures SKILL.md files for AI coding agents, including YAML frontmatter, trigger phrases, directive instructions, decision trees, code examples, and verification checklists. Use when the user asks to write a new skill, create a skill file, author agent capabilities, generate skill documentation, or define a skill template for Claude Code agents.

trace-and-isolate

737
from rohitg00/skillkit

Applies systematic tracing and isolation techniques to pinpoint exactly where a bug originates in code. Use when a bug is hard to locate, code is not working as expected, an error or crash appears with unclear cause, a regression was introduced between recent commits, or you need to narrow down which component, function, or line is faulty. Covers binary search debugging, git bisect for regressions, strategic logging with [TRACE] patterns, data and control flow tracing, component isolation, minimal reproduction cases, conditional breakpoints, and watch expressions across TypeScript, SQL, and bash.

root-cause-analysis

737
from rohitg00/skillkit

Performs systematic root cause analysis to identify the true source of bugs, errors, and unexpected behavior through structured investigation phases — not just treating symptoms. Use when a user reports a bug, crash, error, or broken behavior and needs to debug, troubleshoot, or investigate why something is not working; especially for complex or intermittent issues across multiple components. Applies the Five Whys method, hypothesis-driven testing, stack trace analysis, git blame/log evidence gathering, and causal chain documentation to isolate and confirm root causes before applying any fix.

structured-code-review

737
from rohitg00/skillkit

Performs a structured five-stage code review covering requirements compliance, correctness, code quality, testing, and security/performance. Each stage uses targeted checklists and categorized feedback (Blocker/Major/Minor/Nit) with actionable suggestions and rationale. Use when the user asks for code review, PR feedback, pull request review, or wants their code checked for bugs, style issues, or vulnerabilities — triggered by phrases like "review my code", "check this PR", "review my changes", "pull request review", or "code feedback".

parallel-investigation

737
from rohitg00/skillkit

Coordinates parallel investigation threads to simultaneously explore multiple hypotheses or root causes across different system areas. Use when debugging production incidents, slow API performance, multi-system integration failures, or complex bugs where the root cause is unclear and multiple plausible theories exist; when serial troubleshooting is too slow; or when multiple investigators can divide root-cause analysis work. Provides structured phases for problem decomposition, thread assignment, sync points with Continue/Pivot/Converge decisions, and final report synthesis.