notion-incident-runbook
Execute Notion incident response procedures with triage, mitigation, and postmortem. Use when responding to Notion API outages, investigating errors, or running post-incident reviews for Notion integration failures. Trigger with phrases like "notion incident", "notion outage", "notion down", "notion on-call", "notion emergency", "notion broken".
Best use case
notion-incident-runbook is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Execute Notion incident response procedures with triage, mitigation, and postmortem. Use when responding to Notion API outages, investigating errors, or running post-incident reviews for Notion integration failures. Trigger with phrases like "notion incident", "notion outage", "notion down", "notion on-call", "notion emergency", "notion broken".
Teams using notion-incident-runbook should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/notion-incident-runbook/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How notion-incident-runbook Compares
| Feature / Agent | notion-incident-runbook | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Execute Notion incident response procedures with triage, mitigation, and postmortem. Use when responding to Notion API outages, investigating errors, or running post-incident reviews for Notion integration failures. Trigger with phrases like "notion incident", "notion outage", "notion down", "notion on-call", "notion emergency", "notion broken".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# Notion Incident Runbook
## Overview
Rapid incident response procedures for Notion API failures. This runbook covers a structured triage flow (under 5 minutes), automated health checks against both status.notion.so and your own integration, a decision tree for classifying failures (Notion-side vs. integration-side), per-error-type mitigation with real `Client` code, cached fallback patterns, communication templates, and postmortem structure.
## Prerequisites
- Access to application monitoring dashboards and log aggregator
- `NOTION_TOKEN` environment variable set for diagnostic API calls
- `curl` and `jq` installed for quick CLI triage
- Python alternative: `notion-client` (`pip install notion-client`)
- Communication channels configured (Slack webhook, PagerDuty, etc.)
## Instructions
### Step 1: Quick Triage (Under 5 Minutes)
Run this diagnostic script to determine if the issue is Notion-side or integration-side:
```bash
#!/bin/bash
# notion-triage.sh — run at first alert
set -euo pipefail
echo "=== Notion Incident Triage ==="
echo "Time: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
# 1. Check Notion's public status page
echo -e "\n--- Notion Platform Status ---"
STATUS=$(curl -sf https://status.notion.so/api/v2/status.json \
| jq -r '.status.description' 2>/dev/null || echo "UNREACHABLE")
echo "Notion Status: $STATUS"
INCIDENTS=$(curl -sf https://status.notion.so/api/v2/incidents/unresolved.json \
| jq '.incidents | length' 2>/dev/null || echo "UNKNOWN")
echo "Active Incidents: $INCIDENTS"
if [ "$INCIDENTS" != "0" ] && [ "$INCIDENTS" != "UNKNOWN" ]; then
echo "INCIDENT DETAILS:"
curl -sf https://status.notion.so/api/v2/incidents/unresolved.json \
| jq -r '.incidents[] | " - \(.name) (\(.status)): \(.incident_updates[0].body)"'
fi
# 2. Test our integration authentication
echo -e "\n--- Integration Auth Check ---"
AUTH_HTTP=$(curl -sf -o /dev/null -w "%{http_code}" \
https://api.notion.com/v1/users/me \
-H "Authorization: Bearer ${NOTION_TOKEN}" \
-H "Notion-Version: 2022-06-28" 2>/dev/null || echo "000")
echo "Auth HTTP Status: $AUTH_HTTP"
if [ "$AUTH_HTTP" = "200" ]; then
BOT_NAME=$(curl -sf https://api.notion.com/v1/users/me \
-H "Authorization: Bearer ${NOTION_TOKEN}" \
-H "Notion-Version: 2022-06-28" | jq -r '.name')
echo "Bot Name: $BOT_NAME"
fi
# 3. Test database query (if test DB configured)
echo -e "\n--- API Responsiveness ---"
if [ -n "${NOTION_TEST_DATABASE_ID:-}" ]; then
QUERY_RESULT=$(curl -sf -o /dev/null -w "%{http_code} %{time_total}s" \
-X POST "https://api.notion.com/v1/databases/${NOTION_TEST_DATABASE_ID}/query" \
-H "Authorization: Bearer ${NOTION_TOKEN}" \
-H "Notion-Version: 2022-06-28" \
-H "Content-Type: application/json" \
-d '{"page_size": 1}' 2>/dev/null || echo "000 0.000s")
echo "Database Query: $QUERY_RESULT"
else
echo "NOTION_TEST_DATABASE_ID not set — skipping query test"
fi
# 4. Classification
echo -e "\n--- Triage Result ---"
if [ "$STATUS" != "All Systems Operational" ] && [ "$STATUS" != "UNREACHABLE" ]; then
echo "CLASSIFICATION: Notion-side issue. Enable fallback mode."
elif [ "$AUTH_HTTP" = "401" ]; then
echo "CLASSIFICATION: Token expired or revoked. Rotate immediately."
elif [ "$AUTH_HTTP" = "429" ]; then
echo "CLASSIFICATION: Rate limited. Reduce concurrency."
elif [ "$AUTH_HTTP" = "000" ]; then
echo "CLASSIFICATION: Network/DNS issue. Check firewall and DNS."
else
echo "CLASSIFICATION: Integration-side issue. Check application logs."
fi
```
**TypeScript — programmatic triage:**
```typescript
import { Client, isNotionClientError, APIErrorCode } from '@notionhq/client';
async function triageNotionHealth(token: string): Promise<{
classification: string;
notionStatus: string;
authStatus: string;
latencyMs: number;
}> {
// Check Notion status page
let notionStatus = 'unknown';
try {
const res = await fetch('https://status.notion.so/api/v2/status.json');
const data = await res.json();
notionStatus = data.status.description;
} catch { notionStatus = 'unreachable'; }
// Test our authentication
const client = new Client({ auth: token, timeoutMs: 10_000 });
const start = Date.now();
let authStatus = 'unknown';
let classification = 'unknown';
try {
await client.users.me({});
authStatus = 'authenticated';
classification = 'integration-side';
} catch (error) {
if (isNotionClientError(error)) {
authStatus = `${error.code} (HTTP ${error.status})`;
switch (error.code) {
case APIErrorCode.Unauthorized:
classification = 'token-expired';
break;
case APIErrorCode.RateLimited:
classification = 'rate-limited';
break;
case APIErrorCode.ServiceUnavailable:
classification = 'notion-down';
break;
default:
classification = 'api-error';
}
} else {
authStatus = 'network-error';
classification = 'network-issue';
}
}
if (notionStatus !== 'All Systems Operational') {
classification = 'notion-side';
}
return {
classification,
notionStatus,
authStatus,
latencyMs: Date.now() - start,
};
}
```
### Step 2: Decision Tree and Mitigation
```
Is status.notion.so showing an incident?
|
+-- YES --> Notion-side outage
| +-- Enable cached/fallback mode
| +-- Notify users of degraded service
| +-- Monitor status page for resolution
| +-- DO NOT restart or rotate tokens
|
+-- NO --> Our integration issue
|
+-- Auth returning 401?
| +-- YES --> Token expired or revoked
| | +-- Regenerate at notion.so/my-integrations
| | +-- Update secret manager (see below)
| | +-- Restart application
| +-- NO --> Continue
|
+-- Getting 429 rate limits?
| +-- YES --> Exceeding 3 req/s average
| | +-- Check for runaway loops or webhook storms
| | +-- Reduce concurrency to 1
| | +-- Add exponential backoff
| +-- NO --> Continue
|
+-- Getting 404 on specific resources?
| +-- YES --> Pages unshared or deleted
| | +-- Re-share pages with integration via Connections menu
| | +-- Check if pages were moved to trash
| +-- NO --> Continue
|
+-- Getting 400 validation errors?
| +-- YES --> Database schema changed in Notion UI
| | +-- Re-fetch schema (databases.retrieve)
| | +-- Compare with expected properties
| | +-- Update property mappings in code
| +-- NO --> Investigate application logs
```
**Token rotation:**
```bash
# AWS Secrets Manager
aws secretsmanager update-secret \
--secret-id notion/production \
--secret-string '{"token":"ntn_NEW_TOKEN_HERE"}'
# GCP Secret Manager
echo -n "ntn_NEW_TOKEN_HERE" | \
gcloud secrets versions add notion-token-prod --data-file=-
# Restart to pick up new token
kubectl rollout restart deployment/my-app # Kubernetes
# or: gcloud run services update my-service --no-traffic # Cloud Run
```
**Cached fallback for Notion outages:**
```typescript
import { Client, isNotionClientError } from '@notionhq/client';
const notion = new Client({ auth: process.env.NOTION_TOKEN! });
const cache = new Map<string, { data: any; timestamp: number }>();
const CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes
async function queryWithFallback(dbId: string, filter?: any) {
const cacheKey = `query:${dbId}:${JSON.stringify(filter)}`;
try {
const result = await notion.databases.query({
database_id: dbId,
filter,
page_size: 100,
});
// Update cache on success
cache.set(cacheKey, { data: result, timestamp: Date.now() });
return { data: result, source: 'live' as const };
} catch (error) {
// Fall back to cache on any API error
const cached = cache.get(cacheKey);
if (cached && Date.now() - cached.timestamp < CACHE_TTL_MS) {
console.warn(`Notion unavailable, serving cached data (age: ${
Math.round((Date.now() - cached.timestamp) / 1000)
}s)`);
return { data: cached.data, source: 'cache' as const };
}
// No cache available — re-throw
throw error;
}
}
// Schema change detection
async function detectSchemaChanges(dbId: string, expectedProps: string[]) {
const db = await notion.databases.retrieve({ database_id: dbId });
const actualProps = Object.keys(db.properties);
const missing = expectedProps.filter(p => !actualProps.includes(p));
const unexpected = actualProps.filter(p => !expectedProps.includes(p));
if (missing.length > 0 || unexpected.length > 0) {
console.error(JSON.stringify({
event: 'schema_change_detected',
database_id: dbId,
missing_properties: missing,
new_properties: unexpected,
}));
}
return { missing, unexpected, current: actualProps };
}
```
### Step 3: Communication and Postmortem
**Internal Slack notification template:**
```
:rotating_light: P[1-4] INCIDENT: Notion Integration
Status: [INVESTIGATING | MITIGATING | RESOLVED]
Impact: [specific user-facing impact]
Root Cause: [Notion outage | Token expired | Rate limited | Schema change]
Action: [current remediation step]
ETA: [estimated resolution or "monitoring"]
Dashboard: [link to monitoring dashboard]
Thread: [link to incident channel thread]
```
**External status page update:**
```
Notion Integration Service Disruption
We are experiencing [brief description of impact]. [Specific feature]
may be unavailable or show stale data.
Workaround: [if available, e.g., "Cached data is being served"]
Next update: [time, e.g., "in 30 minutes or sooner if resolved"]
[ISO 8601 timestamp]
```
**Postmortem template:**
```markdown
## Incident: Notion [Error Type] — [Date]
**Duration:** X hours Y minutes
**Severity:** P[1-4]
**Detection:** [Alert name] / [User report]
### Summary
[1-2 sentence description of what happened and the user impact]
### Timeline (all times UTC)
- HH:MM — First alert fired ([alert name])
- HH:MM — On-call acknowledged, began triage
- HH:MM — Root cause identified: [description]
- HH:MM — Mitigation applied: [action taken]
- HH:MM — Service fully restored
### Root Cause
[Technical explanation — e.g., "Integration token was rotated in Notion
dashboard by a team member without updating the secret manager, causing
all API calls to return 401 Unauthorized."]
### Impact
- Users affected: N
- Duration of degraded service: X minutes
- Data loss: [none | description]
### Action Items
| Priority | Action | Owner | Due |
|----------|--------|-------|-----|
| P1 | [Preventive measure] | @name | YYYY-MM-DD |
| P2 | [Detection improvement] | @name | YYYY-MM-DD |
| P3 | [Process improvement] | @name | YYYY-MM-DD |
```
## Output
- Automated triage script classifying incidents in under 5 minutes
- Decision tree mapping HTTP status codes to root causes
- Per-error-type mitigation procedures with real code
- Cached fallback mode for Notion outages
- Schema change detection for 400 validation errors
- Communication templates for internal and external stakeholders
- Postmortem template with timeline and action items
## Error Handling
| Scenario | Triage Signal | Immediate Action |
|----------|--------------|------------------|
| Notion platform outage | status.notion.so incident | Enable fallback mode, notify users |
| Token expired/revoked | All requests return 401 | Rotate token in secret manager, restart |
| Rate limited | 429 errors spiking | Reduce concurrency to 1, check for loops |
| Schema changed | 400 on specific operations | Run `databases.retrieve`, update mappings |
| Network/DNS issue | Timeouts, no HTTP response | Check firewall, DNS resolution, proxy config |
| Pages unshared | 404 on previously working pages | Re-share via Connections menu in Notion |
## Examples
### One-Line Health Check
```bash
curl -sf https://api.notion.com/v1/users/me \
-H "Authorization: Bearer ${NOTION_TOKEN}" \
-H "Notion-Version: 2022-06-28" \
| jq '{name: .name, type: .type}' \
|| echo "UNHEALTHY: Notion API unreachable or auth failed"
```
### Python Quick Triage
```python
from notion_client import Client, APIResponseError
import os
def quick_triage():
try:
client = Client(auth=os.environ["NOTION_TOKEN"], timeout_ms=10_000)
me = client.users.me()
print(f"OK: Connected as {me['name']}")
except APIResponseError as e:
print(f"ERROR: {e.code} (HTTP {e.status}): {e.message}")
except Exception as e:
print(f"NETWORK ERROR: {e}")
quick_triage()
```
## Resources
- [Notion Status Page](https://status.notion.so) — real-time platform status
- [Notion API Error Codes](https://developers.notion.com/reference/errors) — full error reference
- [Notion Request Limits](https://developers.notion.com/reference/request-limits) — 3 req/s average
- [Statuspage API](https://www.atlassianstatuspage.io/api) — programmatic status checks
## Next Steps
For data handling and privacy compliance, see `notion-data-handling`.Related Skills
responding-to-security-incidents
Analyze and guide security incident response, investigation, and remediation processes. Use when you need to handle security breaches, classify incidents, develop response playbooks, gather forensic evidence, or coordinate remediation efforts. Trigger with phrases like "security incident response", "ransomware attack response", "data breach investigation", "incident playbook", or "security forensics".
windsurf-incident-runbook
Execute Windsurf incident response when AI features fail or cause production issues. Use when Cascade breaks code, Windsurf service is down, AI-generated code causes production incidents, or team needs emergency Windsurf troubleshooting. Trigger with phrases like "windsurf incident", "windsurf outage", "windsurf broke production", "cascade caused bug", "windsurf emergency".
webflow-incident-runbook
Execute Webflow incident response — triage by HTTP status (401/403/429/500), circuit breaker activation, cached fallback, Webflow status page checks, communication templates, and postmortem process. Trigger with phrases like "webflow incident", "webflow outage", "webflow down", "webflow on-call", "webflow emergency", "webflow broken".
vercel-incident-runbook
Vercel incident response procedures with triage, instant rollback, and postmortem. Use when responding to Vercel-related outages, investigating production errors, or running post-incident reviews for deployment failures. Trigger with phrases like "vercel incident", "vercel outage", "vercel down", "vercel on-call", "vercel emergency", "vercel broken".
veeva-incident-runbook
Veeva Vault incident runbook for enterprise operations. Use when implementing advanced Veeva Vault patterns. Trigger: "veeva incident runbook".
vastai-incident-runbook
Execute Vast.ai incident response for GPU instance failures and outages. Use when responding to instance failures, investigating training crashes, or handling spot preemption emergencies. Trigger with phrases like "vastai incident", "vastai outage", "vastai down", "vastai emergency", "vastai instance failed".
twinmind-incident-runbook
Incident response for TwinMind failures: transcription not starting, audio not captured, sync failures, and calendar disconnect. Use when implementing incident runbook, or managing TwinMind meeting AI operations. Trigger with phrases like "twinmind incident runbook", "twinmind incident runbook".
supabase-incident-runbook
Execute Supabase incident response: dashboard health checks, connection pool status, pg_stat_activity queries, RLS debugging, Edge Function logs, storage health, and escalation. Use when responding to Supabase outages, investigating production errors, debugging connection issues, or preparing evidence for Supabase support escalation. Trigger: "supabase incident", "supabase outage", "supabase down", "supabase on-call", "supabase emergency", "supabase broken", "supabase connection issues".
speak-incident-runbook
Incident response for Speak API outages: triage, fallback to offline mode, and recovery procedures. Use when implementing incident runbook, or managing Speak language learning platform operations. Trigger with phrases like "speak incident runbook", "speak incident runbook".
snowflake-incident-runbook
Execute Snowflake incident response with triage, rollback, and postmortem using real SQL diagnostics. Use when responding to Snowflake outages, investigating query failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "snowflake incident", "snowflake outage", "snowflake down", "snowflake on-call", "snowflake emergency".
shopify-incident-runbook
Execute Shopify incident response with triage using Shopify status page, API health checks, and rate limit diagnosis. Trigger with phrases like "shopify incident", "shopify outage", "shopify down", "shopify on-call", "shopify emergency", "shopify not responding".
sentry-incident-runbook
Execute incident response procedures using Sentry error monitoring. Use when investigating production outages, triaging error spikes, classifying incident severity, or building postmortem reports from Sentry data. Trigger with phrases like "sentry incident", "sentry triage", "investigate sentry error", "sentry runbook", "production incident sentry".