root-cause-analysis

Find the true source, not symptoms — systematic debugging from observation to permanent fix

16 stars

Best use case

root-cause-analysis is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Find the true source, not symptoms — systematic debugging from observation to permanent fix

Teams using root-cause-analysis should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/root-cause-analysis/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/development/root-cause-analysis/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/root-cause-analysis/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How root-cause-analysis Compares

Feature / Agent	root-cause-analysis	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Find the true source, not symptoms — systematic debugging from observation to permanent fix

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Root Cause Analysis Skill

> If you fixed it but it came back, you fixed a symptom.

## Core Principle

Every symptom has a cause. Every cause has a deeper cause. Keep digging until you reach something you can *prevent*, not just fix.

## 5 Whys — Extended Example

| # | Question | Answer |
| - | -------- | ------ |
| 1 | Why did the page crash? | JavaScript threw a TypeError on null |
| 2 | Why was the value null? | The API returned an empty response |
| 3 | Why did the API return empty? | The database query timed out |
| 4 | Why did the query time out? | Missing index on a 10M-row table |
| 5 | Why was the index missing? | No performance review in the PR process |

**Root cause**: Process gap (no performance review), not the missing index.
**Fix the system**: Add performance checklist to PR template, not just add the index.

### 5 Whys Traps

| Trap | Example | How to Avoid |
| ---- | ------- | ------------ |
| Stopping at human error | "Dev forgot to add the index" | Ask *why was it possible to forget?* |
| Single chain only | Only follow one branch | Branch at each Why if multiple causes |
| Speculation without evidence | "Probably because of..." | Each answer must have evidence |
| Going too deep | Why #12: "Because physics" | Stop when you reach an actionable system change |

## Cause Categories

| Category | Common Patterns | Investigation Tools |
| -------- | --------------- | ------------------- |
| Code | Null reference, off-by-one, race condition, type mismatch | Debugger, unit tests, static analysis |
| Data | Corrupt input, unexpected format, encoding issues | Query logs, data validation, sample inspection |
| Infrastructure | Disk full, memory exhaustion, network partition | Metrics dashboards, health endpoints, `top`/`df` |
| Dependencies | Breaking change, version mismatch, transitive conflict | Lockfile diff, changelog review, `npm ls` |
| Configuration | Wrong env var, feature flag state, missing secret | Config diff, environment comparison |
| Process | Missing review, unclear ownership, no runbook | Post-mortem patterns, team interviews |

## Investigation Techniques

### Binary Search Debugging

When you don't know where the bug is, halve the search space:

1. Identify the last known good state (commit, deploy, timestamp)
2. `git bisect` between good and bad
3. Each step: does the bug exist? Yes → go earlier. No → go later.
4. Result: the exact commit that introduced the bug.

### Timeline Reconstruction

| Time | Event | Source |
| ---- | ----- | ------ |
| T-24h | Deploy v2.3.1 | CI/CD logs |
| T-12h | Config change: cache TTL 60→30s | Config audit log |
| T-2h | First user report | Support tickets |
| T-0 | Alert fired | Monitoring |

**Key question**: What changed between "working" and "broken"?

### Correlation vs Causation

| Evidence Type | Confidence | Example |
| ------------- | ---------- | ------- |
| Reproduces on demand | High | "Every time I submit this form..." |
| Correlates with a deploy | Medium | "Started after we deployed" |
| Timing coincidence | Low | "Started Monday" (traffic patterns?) |
| "It's never done this before" | Very Low | Memory is unreliable — check logs |

## Fix + Prevent Pattern

| Phase | Purpose | Example | Deadline |
| ----- | ------- | ------- | -------- |
| **Immediate** | Stop the bleeding | Rollback, disable feature, redirect traffic | Now |
| **Permanent** | Fix root cause | Add missing index, fix validation, patch dependency | This sprint |
| **Prevention** | Stop recurrence | Add CI check, monitoring alert, runbook, PR checklist | Next sprint |

**Test the fix**: The permanent fix should make the immediate fix unnecessary. If you remove the band-aid and the symptom returns, you haven't found root cause.

## Common Symptom → Root Cause Patterns

| Symptom | Obvious Cause | Deeper Root Cause |
| ------- | ------------- | ----------------- |
| Memory leak | Unclosed resource | No resource cleanup pattern in codebase |
| N+1 queries | Missing join | ORM hides query count, no query logging |
| Intermittent test failure | Timing-dependent | Shared mutable state between tests |
| "Works on my machine" | Different environment | No environment parity tooling (Docker, etc.) |
| Data corruption | Missing validation | Validation in UI only, not at API boundary |
| Slow deploys | Large artifact | No build caching, monorepo without selective builds |

## Post-Mortem Integration

The RCA section of a post-mortem should include:

1. **The 5 Whys chain** (with evidence for each level)
2. **Contributing factors** (things that made it worse, not the direct cause)
3. **What we were lucky about** (things that could have made it much worse)
4. **Action items** with owners and dates for permanent fix + prevention

## Synapses

See [synapses.json](synapses.json) for connections.

Related Skills

stride-analysis-patterns

from diegosouzapw/awesome-omni-skill

Apply STRIDE methodology to systematically identify threats. Use when analyzing system security, conducting threat modeling sessions, or creating security documentation.

statistical-analysis-spa

from diegosouzapw/awesome-omni-skill

웹 기반 통계 분석 SPA 개발 스킬. 이상치 탐지(Outlier Detection)와 행별 통계 분석(Row Statistics)을 수행하는 React 애플리케이션 구현. Z-Score, IQR, MAD, Grubbs, Winsorize 이상치 탐지와 T-test, ANOVA 통계 분석 지원. Copy & Paste 또는 CSV/TXT 파일 드래그 앤 드롭으로 데이터 입력, Recharts를 활용한 시각화 기능 포함. 모든 데이터는 로컬에서만 처리되며 네트워크 전송 없음.

smiles_comprehensive_analysis

from diegosouzapw/awesome-omni-skill

SMILES Comprehensive Analysis - Comprehensive SMILES analysis: validate, convert name, compute all molecular descriptors, and predict ADMET. Use this skill for cheminformatics tasks involving is valid smiles ChemicalStructureAnalyzer calculate mol basic info pred molecule admet. Combines 4 tools from 3 SCP server(s).

rhetorical-analysis

from diegosouzapw/awesome-omni-skill

Analyse rhétorique et épistémologique d'articles, discours et textes argumentatifs. Utiliser ce skill quand l'utilisateur demande d'analyser la qualité argumentative d'un texte, d'identifier des sophismes ou biais, d'évaluer la fiabilité des sources citées, de déconstruire la logique d'un raisonnement, ou de produire une réécriture critique structurée d'un document.

regulatory-community-analysis-ChIA-PET

from diegosouzapw/awesome-omni-skill

This skill performs protein-mediated regulatory community analysis from ChIA-PET datasets and provide a way for visualizing the communities. Use this skill when you have a annotated peak file (in BED format) from ChIA-PET experiment and you want to identify the protein-mediated regulatory community according to the BED and BEDPE file from ChIA-PET.

project-analysis

from diegosouzapw/awesome-omni-skill

Analyzes any project to understand its structure, tech stack, patterns, and conventions. Use when starting work on a new codebase, onboarding, or when asked "how does this project work?" or "what's the architecture?"

prd-analysis

from diegosouzapw/awesome-omni-skill

PRD parsing and task decomposition patterns for intake workflows.

manifold-analysis

from diegosouzapw/awesome-omni-skill

Analyze Manifold Markets prediction market data. Use when processing HTML exports or trade history from manifold.markets to create visualizations of trading volume, trader leaderboards, probability movements, and market dynamics. Triggers on requests involving Manifold Markets data, prediction market analysis, or when user uploads Manifold HTML files.

error-root-analyzer

from diegosouzapw/awesome-omni-skill

Comprehensive error analysis and root cause resolution. Use when programs fail, crash, or produce errors during execution. This skill performs deep debugging by identifying root causes (not just surface-level symptoms), conducting thorough module reviews to uncover related bugs and exceptions, and implementing holistic fixes that address all discovered issues.

error-diagnostics-error-analysis

from diegosouzapw/awesome-omni-skill

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

error-debugging-error-analysis

from diegosouzapw/awesome-omni-skill

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

codebase-analysis

from diegosouzapw/awesome-omni-skill

Systematically analyze codebase structure, complexity, dependencies, and architectural patterns to understand project organization