Codex

flaky-detect

Identify flaky tests from CI history and test execution patterns. Use when debugging intermittent test failures, auditing test reliability, or improving CI stability.

104 stars

Best use case

flaky-detect is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

It is a strong fit for teams already working in Codex.

Identify flaky tests from CI history and test execution patterns. Use when debugging intermittent test failures, auditing test reliability, or improving CI stability.

Teams using flaky-detect should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/flaky-detect/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/.agents/skills/flaky-detect/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/flaky-detect/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How flaky-detect Compares

Feature / Agentflaky-detectStandard Approach
Platform SupportCodexLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Identify flaky tests from CI history and test execution patterns. Use when debugging intermittent test failures, auditing test reliability, or improving CI stability.

Which AI agents support this skill?

This skill is designed for Codex.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Flaky Detect Skill

## Purpose

Identify flaky tests (tests that pass and fail non-deterministically) by analyzing CI history, execution patterns, and test characteristics. Google research shows 4.56% of tests are flaky, costing millions in developer productivity.

## Research Foundation

| Finding | Source | Reference |
|---------|--------|-----------|
| 4.56% flaky rate | Google (2016) | [Flaky Tests at Google](https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html) |
| ML Classification | FlaKat (2024) | [arXiv:2403.01003](https://arxiv.org/abs/2403.01003) - 85%+ accuracy |
| LLM Auto-repair | FlakyFix (2023) | [arXiv:2307.00012](https://arxiv.org/html/2307.00012v4) |
| Flaky Taxonomy | Luo et al. (2014) | "An Empirical Analysis of Flaky Tests" |

## When This Skill Applies

- User reports "tests sometimes fail" or "intermittent failures"
- CI has been unstable or unreliable
- User wants to audit test suite reliability
- Pre-release quality assessment
- Debugging non-deterministic behavior

## Trigger Phrases

| Natural Language | Action |
|------------------|--------|
| "Find flaky tests" | Analyze CI history for flaky patterns |
| "Why does CI keep failing?" | Identify flaky tests causing failures |
| "Test suite is unreliable" | Full flaky test audit |
| "This test sometimes passes" | Analyze specific test for flakiness |
| "Audit test reliability" | Comprehensive flaky detection |
| "Quarantine flaky tests" | Identify and isolate flaky tests |

## Flaky Test Taxonomy (Google Research)

| Category | Percentage | Root Causes |
|----------|------------|-------------|
| **Async/Timing** | 45% | Race conditions, insufficient waits, timeouts |
| **Test Order** | 20% | Shared state, execution order dependencies |
| **Environment** | 15% | File system, network, configuration differences |
| **Resource Limits** | 10% | Memory, threads, connection pools |
| **Non-deterministic** | 10% | Random values, timestamps, UUIDs |

## Detection Methods

### 1. CI History Analysis

Parse GitHub Actions / CI logs to find inconsistent results:

```python
def analyze_ci_history(repo, days=30):
    """Analyze CI runs for flaky patterns"""
    runs = get_ci_runs(repo, days)
    test_results = {}

    for run in runs:
        for test in run.tests:
            if test.name not in test_results:
                test_results[test.name] = {"pass": 0, "fail": 0}

            if test.passed:
                test_results[test.name]["pass"] += 1
            else:
                test_results[test.name]["fail"] += 1

    # Identify flaky tests (pass rate between 5% and 95%)
    flaky = []
    for test, results in test_results.items():
        total = results["pass"] + results["fail"]
        if total >= 5:  # Enough data
            pass_rate = results["pass"] / total
            if 0.05 < pass_rate < 0.95:
                flaky.append({
                    "test": test,
                    "pass_rate": pass_rate,
                    "total_runs": total
                })

    return sorted(flaky, key=lambda x: x["pass_rate"])
```

### 2. Code Pattern Analysis

Scan test code for flaky patterns:

```python
FLAKY_PATTERNS = [
    # Timing issues
    (r'setTimeout|sleep|delay', "timing", "Uses explicit delays"),
    (r'Date\.now\(\)|new Date\(\)', "timing", "Uses current time"),

    # Async issues
    (r'\.then\([^)]*\)(?!.*await)', "async", "Promise without await"),
    (r'async.*(?!await)', "async", "Async without await"),

    # Order dependencies
    (r'Math\.random\(\)', "random", "Uses random values"),
    (r'uuid|nanoid', "random", "Uses generated IDs"),

    # Environment
    (r'process\.env', "environment", "Environment-dependent"),
    (r'fs\.(read|write)', "environment", "File system access"),
    (r'fetch\(|axios\.|http\.', "network", "Network calls"),
]

def scan_for_flaky_patterns(test_file):
    """Scan test file for flaky patterns"""
    content = read_file(test_file)
    matches = []

    for pattern, category, description in FLAKY_PATTERNS:
        if re.search(pattern, content):
            matches.append({
                "category": category,
                "description": description,
                "pattern": pattern
            })

    return matches
```

### 3. Re-run Analysis

Run tests multiple times to detect flakiness:

```bash
# Run tests 10 times, track results
for i in {1..10}; do
  npm test -- --reporter=json >> test-results.jsonl
done

# Analyze for inconsistency
python analyze_reruns.py test-results.jsonl
```

## Output Format

```markdown
## Flaky Test Report

**Analysis Period**: Last 30 days
**Total Tests**: 450
**Flaky Tests Found**: 12 (2.7%)

### Critical Flaky Tests (< 50% pass rate)

#### 1. `test/api/login.test.ts:45`
**Pass Rate**: 42% (21/50 runs)
**Category**: Timing
**Pattern**: Uses `Date.now()` for token expiry

```typescript
// Flaky code
it('should expire token after 1 hour', () => {
  const token = createToken();
  const expiry = Date.now() + 3600000;  // Flaky!
  expect(token.expiresAt).toBe(expiry);
});
```

**Root Cause**: Test creates token and checks expiry in same millisecond sometimes, different millisecond other times.

**Recommended Fix**: Use mocked time
```typescript
it('should expire token after 1 hour', () => {
  vi.setSystemTime(new Date('2024-01-01T00:00:00Z'));
  const token = createToken();
  expect(token.expiresAt).toBe(new Date('2024-01-01T01:00:00Z').getTime());
  vi.useRealTimers();
});
```

### High Flaky Tests (50-80% pass rate)

#### 2. `test/db/connection.test.ts:23`
**Pass Rate**: 68% (34/50 runs)
**Category**: Resource
**Pattern**: Connection pool exhaustion

[... more tests ...]

### Summary by Category

| Category | Count | Impact |
|----------|-------|--------|
| Timing | 5 | HIGH |
| Async | 3 | HIGH |
| Environment | 2 | MEDIUM |
| Order | 1 | MEDIUM |
| Network | 1 | LOW |

### Recommendations

1. **Quick Win**: Fix 5 timing tests with `vi.setSystemTime()` (+0.5% stability)
2. **Medium Effort**: Add proper async handling (+0.3% stability)
3. **Infrastructure**: Add test isolation for DB tests (+0.2% stability)

### Quarantine Candidates

These tests should be skipped in CI until fixed:

```javascript
// vitest.config.ts
export default {
  test: {
    exclude: [
      'test/api/login.test.ts',       // Timing flaky
      'test/db/connection.test.ts',   // Resource flaky
    ]
  }
}
```

**Note**: Track quarantined tests in `.aiwg/testing/flaky-quarantine.md`
```

## Quarantine Process

### 1. Identify

```bash
# Run flaky detection
python scripts/flaky_detect.py --ci-history 30 --threshold 95
```

### 2. Quarantine

```javascript
// Mark test as flaky
describe.skip('flaky: login expiry', () => {
  // FLAKY: https://github.com/org/repo/issues/123
  // Root cause: timing-dependent
  // Fix in progress: PR #456
});
```

### 3. Track

Create tracking issue:
```markdown
## Flaky Test: test/api/login.test.ts:45

- **Pass Rate**: 42%
- **Category**: Timing
- **Root Cause**: Uses real system time
- **Quarantined**: 2024-12-12
- **Fix PR**: #456
- **Target Unquarantine**: 2024-12-15
```

### 4. Fix and Unquarantine

After fix:
```bash
# Verify fix with multiple runs
for i in {1..20}; do npm test -- test/api/login.test.ts; done

# Remove from quarantine if all pass
```

## Integration Points

- Works with `flaky-fix` skill for automated repairs
- Reports to CI dashboard
- Feeds into `/flow-gate-check` for release decisions
- Tracks in `.aiwg/testing/flaky-registry.md`

## Script Reference

### flaky_detect.py
Analyze CI history for flaky tests:
```bash
python scripts/flaky_detect.py --repo owner/repo --days 30
```

### flaky_scanner.py
Scan code for flaky patterns:
```bash
python scripts/flaky_scanner.py --target test/
```

## References

- @$AIWG_ROOT/agentic/code/addons/testing-quality/README.md — Testing quality addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Research-first approach for root cause analysis
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context for quality gates
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference

Related Skills

research-gap-detect

104
from jmagly/aiwg

Build the mutual citation graph, find connected components, identify isolated clusters, and optionally search for bridge candidates and file gap issues. Automates the manual cluster analysis workflow.

Codex

prose-detect

104
from jmagly/aiwg

Locate an existing OpenProse installation using a prioritized signal chain — env var, AIWG config, AIWG-local install, project plugin manifest, user home directory, or global CLI. Returns the resolved PROSE_ROOT path. Does not install OpenProse; triggers prose-setup if no installation is found.

Codex

flaky-fix

104
from jmagly/aiwg

Suggest and apply fixes for flaky tests based on detected patterns. Use after flaky-detect identifies unreliable tests that need repair.

Codex

ai-pattern-detection

104
from jmagly/aiwg

Detects AI-generated writing patterns and suggests authentic alternatives. Auto-applies when reviewing content, editing documents, generating text, or when user mentions writing quality, AI detection, authenticity, or natural voice.

Codex

aiwg-orchestrate

104
from jmagly/aiwg

Route structured artifact work to AIWG workflows via MCP with zero parent context cost

venv-manager

104
from jmagly/aiwg

Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.

pytest-runner

104
from jmagly/aiwg

Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.

vitest-runner

104
from jmagly/aiwg

Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.

eslint-checker

104
from jmagly/aiwg

Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.

repo-analyzer

104
from jmagly/aiwg

Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.

pr-reviewer

104
from jmagly/aiwg

Review GitHub pull requests for code quality, security, and best practices. Use for automated PR feedback and approval workflows.

YouTube Acquisition

104
from jmagly/aiwg

yt-dlp patterns for acquiring content from YouTube and video platforms