incident-triage
Classify incidents by severity, assemble response teams, and coordinate initial response actions and comms
Best use case
incident-triage is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
It is a strong fit for teams already working in Codex.
Classify incidents by severity, assemble response teams, and coordinate initial response actions and comms
Teams using incident-triage should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/incident-triage/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How incident-triage Compares
| Feature / Agent | incident-triage | Standard Approach |
|---|---|---|
| Platform Support | Codex | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Classify incidents by severity, assemble response teams, and coordinate initial response actions and comms
Which AI agents support this skill?
This skill is designed for Codex.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
SKILL.md Source
# incident-triage
Rapid incident classification, severity assessment, and response coordination.
## Triggers
Alternate expressions and non-obvious activations (primary phrases are matched automatically from the skill description):
- "P0" / "P1" / "SEV1" / "SEV2" → severity-based incident triage
- "we got paged" → production incident response
- "war room" → incident coordination setup
## Purpose
This skill provides rapid incident response coordination by:
- Classifying incident type and severity
- Assembling response team
- Coordinating initial response actions
- Tracking timeline and status
- Facilitating communication
- Preparing post-incident review
## Behavior
When triggered, this skill:
1. **Gathers incident details**:
- What is happening?
- When did it start?
- Who/what is affected?
- What changed recently?
2. **Classifies severity**:
- Assess customer impact
- Determine scope
- Assign severity level
- Calculate business impact
3. **Assembles response team**:
- Identify required responders
- Notify on-call personnel
- Establish incident commander
4. **Initiates response**:
- Create incident channel/bridge
- Start timeline documentation
- Coordinate initial diagnosis
5. **Manages communication**:
- Internal status updates
- Customer communication (if needed)
- Executive notifications (for high severity)
6. **Tracks resolution**:
- Document actions taken
- Track mitigation progress
- Confirm resolution
- Schedule post-incident review
## Severity Levels
### SEV1 / P0 - Critical
```yaml
sev1:
name: Critical
alias: [P0, SEV1, Critical]
criteria:
- Complete service outage
- Data loss or corruption
- Security breach
- >50% customers affected
- Revenue-impacting
response:
response_time: 15 minutes
update_frequency: 15 minutes
executive_notification: immediate
customer_communication: within 30 minutes
escalation:
- incident_commander: required
- engineering_manager: required
- vp_engineering: within 30 minutes
- cto: within 1 hour (if unresolved)
target_resolution: 4 hours
```
### SEV2 / P1 - High
```yaml
sev2:
name: High
alias: [P1, SEV2, High]
criteria:
- Major feature unavailable
- Significant degradation
- 10-50% customers affected
- Workaround exists but painful
response:
response_time: 30 minutes
update_frequency: 30 minutes
executive_notification: within 1 hour
customer_communication: within 2 hours (if extended)
escalation:
- incident_commander: required
- engineering_manager: within 1 hour
target_resolution: 8 hours
```
### SEV3 / P2 - Medium
```yaml
sev3:
name: Medium
alias: [P2, SEV3, Medium]
criteria:
- Feature partially degraded
- <10% customers affected
- Workaround available
- Non-critical path affected
response:
response_time: 2 hours
update_frequency: 2 hours
executive_notification: daily summary
customer_communication: as needed
escalation:
- team_lead: within 4 hours
target_resolution: 24 hours
```
### SEV4 / P3 - Low
```yaml
sev4:
name: Low
alias: [P3, SEV4, Low]
criteria:
- Minor issue
- Cosmetic problem
- Edge case affected
- Easy workaround
response:
response_time: next business day
update_frequency: daily
executive_notification: weekly summary
escalation: standard ticket flow
target_resolution: 1 week
```
## Incident Response Flow
```
┌─────────────────────────────────────────────────────────────┐
│ 1. DETECTION & TRIAGE │
│ • Alert received or issue reported │
│ • Gather initial details │
│ • Classify severity │
│ • Create incident record │
│ • Time: <15 minutes │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. MOBILIZATION │
│ • Page on-call responders │
│ • Establish incident commander │
│ • Create communication channel │
│ • Notify stakeholders per severity │
│ • Time: <5 minutes after triage │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. INVESTIGATION │
│ • Review recent changes │
│ • Check monitoring/logs │
│ • Identify affected components │
│ • Form hypothesis │
│ • Time: ongoing, status updates per SLA │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. MITIGATION │
│ • Implement workaround if available │
│ • Rollback if change-related │
│ • Scale resources if capacity issue │
│ • Isolate affected components │
│ • Goal: Reduce customer impact │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. RESOLUTION │
│ • Implement permanent fix │
│ • Verify fix is effective │
│ • Monitor for recurrence │
│ • Update status to resolved │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 6. POST-INCIDENT │
│ • Schedule post-incident review │
│ • Document timeline and actions │
│ • Identify root cause │
│ • Create follow-up action items │
│ • Update runbooks/documentation │
└─────────────────────────────────────────────────────────────┘
```
## Incident Record Format
```markdown
# Incident Report: INC-2025-001234
## Summary
| Field | Value |
|-------|-------|
| Title | Database connection pool exhaustion |
| Severity | SEV1 (Critical) |
| Status | Resolved |
| Start Time | 2025-12-08 14:32 UTC |
| Detected | 2025-12-08 14:35 UTC |
| Resolved | 2025-12-08 15:47 UTC |
| Duration | 1h 15m |
| Impact | 100% of API requests failing |
| Customers Affected | ~45,000 |
## Incident Commander
**Name**: Sarah Chen
**Role**: Senior SRE
## Response Team
| Role | Name | Joined |
|------|------|--------|
| Incident Commander | Sarah Chen | 14:38 |
| Backend Lead | David Kim | 14:40 |
| DBA | Elena Rodriguez | 14:45 |
| Comms Lead | James Wilson | 14:50 |
## Impact Assessment
### Customer Impact
- **Scope**: All customers using web and mobile apps
- **Severity**: Complete service outage
- **Duration**: 1h 15m
- **Affected Features**: All authenticated features
### Business Impact
- **Revenue Loss**: Estimated $XX,XXX
- **SLA Breach**: Yes (99.9% monthly target affected)
- **Customer Complaints**: 127 support tickets
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 14:32 | First customer reports of errors |
| 14:35 | PagerDuty alert for 5xx spike |
| 14:38 | Incident declared, Sarah Chen IC |
| 14:40 | Investigation begins |
| 14:45 | Identified: DB connection pool exhausted |
| 14:52 | Root cause: Runaway query from batch job |
| 15:00 | Mitigation: Batch job killed |
| 15:10 | Connection pool recovering |
| 15:30 | 50% traffic restored |
| 15:47 | Full service restored |
| 15:50 | Monitoring confirms stable |
| 16:00 | Incident closed |
## Root Cause
**Summary**: A scheduled batch job contained an inefficient query that held database connections indefinitely, exhausting the connection pool.
**Details**:
- Batch job deployed at 14:00 with new query
- Query had missing index, causing full table scan
- Each scan held connection for 30+ seconds
- 100 concurrent requests × 30s = pool exhausted
- New requests could not get connections → 5xx errors
**Contributing Factors**:
1. Missing index migration in batch job deploy
2. No query timeout configured
3. Connection pool size not tuned for load
4. Batch job ran during peak hours
## Resolution
**Immediate Actions**:
1. Killed runaway batch job
2. Restarted application servers to reset connections
3. Verified service restoration
**Permanent Fixes** (follow-ups):
- [ ] Add missing index (INC-001-01)
- [ ] Configure query timeouts (INC-001-02)
- [ ] Increase connection pool size (INC-001-03)
- [ ] Move batch jobs to off-peak hours (INC-001-04)
- [ ] Add connection pool monitoring alerts (INC-001-05)
## Communication Log
| Time | Channel | Message |
|------|---------|---------|
| 14:45 | #incident-2025-001234 | Incident declared, investigating API failures |
| 15:00 | Status Page | Investigating service disruption |
| 15:15 | Status Page | Identified cause, implementing fix |
| 15:30 | #incident-2025-001234 | Service recovering, 50% restored |
| 15:50 | Status Page | Service fully restored |
| 16:00 | Email to customers | Incident resolved, apology + explanation |
## Post-Incident Review
**Scheduled**: 2025-12-10 10:00 UTC
**Attendees**: Response team + Engineering Manager
**Document**: .aiwg/incidents/INC-2025-001234-pir.md
## Lessons Learned
### What Went Well
- Fast detection (3 minutes)
- Clear incident commander
- Good team coordination
### What Could Improve
- Batch job should have been tested with prod-like data
- Missing connection pool alerts
- No query timeout safety net
## Action Items
| ID | Action | Owner | Due | Status |
|----|--------|-------|-----|--------|
| INC-001-01 | Add missing index | Elena | 2025-12-09 | Done |
| INC-001-02 | Configure query timeouts | David | 2025-12-10 | In Progress |
| INC-001-03 | Tune connection pool | Sarah | 2025-12-12 | Planned |
| INC-001-04 | Reschedule batch jobs | David | 2025-12-15 | Planned |
| INC-001-05 | Add pool monitoring | Sarah | 2025-12-12 | Planned |
```
## Usage Examples
### Declare Incident
```
User: "Production is down, API returning 500 errors"
Skill initiates:
1. Gather details
2. Classify severity
3. Create incident record
4. Notify responders
Output:
"🚨 INCIDENT DECLARED: INC-2025-001234
Severity: SEV1 (Critical) - Complete API failure
Impact: All customers affected
Immediate Actions:
1. ✅ Incident record created
2. ✅ Paging on-call: Sarah Chen (SRE)
3. ✅ Slack channel: #incident-2025-001234
4. ⏳ Assembling response team
You are Incident Commander until relieved.
Quick Checklist:
- [ ] What changed in last 2 hours?
- [ ] Check: logs, metrics, recent deploys
- [ ] Status page update needed?
Update command: 'incident update <message>'"
```
### Severity Assessment
```
User: "Is this SEV1 or SEV2? Users can't checkout but browsing works"
Skill evaluates:
- Core function affected: Yes (checkout = revenue)
- Workaround: No (can't complete purchase)
- Customer %: All attempting checkout
- Revenue impact: Direct
Output:
"Severity Assessment: SEV1 (Critical)
Rationale:
✓ Revenue-impacting feature down
✓ No workaround for affected flow
✓ Checkout is business-critical path
✓ Affects all customers attempting purchase
SEV1 Criteria Met:
- Complete feature outage: checkout
- Revenue impact: direct
- No workaround: cannot complete purchases
Response Requirements:
- 15-minute update cadence
- Executive notification: immediate
- Customer communication: within 30 minutes
- Target resolution: 4 hours"
```
### Status Update
```
User: "Incident update: found the issue, deploying fix"
Skill logs:
1. Add to timeline
2. Update status
3. Notify stakeholders
Output:
"Incident Update: INC-2025-001234
Status: Mitigating
Time: 15:15 UTC
Update Logged:
'Found root cause, deploying fix'
Next Actions:
- [ ] Update status page
- [ ] Notify executive stakeholders
- [ ] Continue timeline documentation
Time Since Start: 43 minutes
Next Update Due: 15:30 UTC"
```
## Integration
This skill uses:
- `project-awareness`: Context for system topology
- `artifact-metadata`: Track incident artifacts
## Agent Orchestration
```yaml
agents:
incident_commander:
agent: incident-responder
focus: Overall coordination and decisions
technical_lead:
agent: debugger
focus: Root cause investigation
reliability:
agent: reliability-engineer
focus: System stability and monitoring
communications:
agent: support-lead
focus: Customer and stakeholder communication
```
## Configuration
### Notification Channels
```yaml
notifications:
sev1:
pagerduty: true
slack: "#incidents-critical"
email: [engineering-leads, on-call-manager]
sms: [incident-commander, vp-engineering]
sev2:
pagerduty: true
slack: "#incidents"
email: [engineering-leads]
sev3:
slack: "#incidents"
email: [team-lead]
sev4:
slack: "#incidents-low"
```
### Escalation Paths
```yaml
escalation:
sev1:
- {time: 0, to: on-call-engineer}
- {time: 15m, to: engineering-manager}
- {time: 30m, to: vp-engineering}
- {time: 1h, to: cto}
sev2:
- {time: 0, to: on-call-engineer}
- {time: 1h, to: engineering-manager}
- {time: 4h, to: vp-engineering}
```
## Output Locations
- Incident records: `.aiwg/incidents/INC-{year}-{id}.md`
- Post-incident reviews: `.aiwg/incidents/INC-{year}-{id}-pir.md`
- Action items: `.aiwg/incidents/action-items.md`
- Metrics: `.aiwg/incidents/metrics/`
## References
- Incident response template: templates/operations/incident-template.md
- Post-incident review template: templates/operations/pir-template.md
- On-call schedule: .aiwg/team/on-call.yaml
- Runbooks: .aiwg/deployment/runbooks/Related Skills
forensics-triage
Quick triage investigation following RFC 3227 volatility order
flow-incident-response
Orchestrate production incident triage, escalation, resolution, and post-incident review using ITIL best practices
aiwg-orchestrate
Route structured artifact work to AIWG workflows via MCP with zero parent context cost
venv-manager
Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.
pytest-runner
Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.
vitest-runner
Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.
eslint-checker
Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.
repo-analyzer
Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.
pr-reviewer
Review GitHub pull requests for code quality, security, and best practices. Use for automated PR feedback and approval workflows.
YouTube Acquisition
yt-dlp patterns for acquiring content from YouTube and video platforms
Quality Filtering
Accept/reject logic and quality scoring heuristics for media content
Provenance Tracking
W3C PROV-O patterns for tracking media derivation chains and production history