oraclecloud-incident-runbook

Self-service incident runbook for OCI outages — health probes, instance recovery, cross-AD/region failover. Use when OCI instances go down, the status page is silent, or you need automated recovery without waiting for support. Trigger with "oraclecloud incident", "oci outage runbook", "oci failover", "oci instance recovery".

1,868 stars

Best use case

oraclecloud-incident-runbook is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Self-service incident runbook for OCI outages — health probes, instance recovery, cross-AD/region failover. Use when OCI instances go down, the status page is silent, or you need automated recovery without waiting for support. Trigger with "oraclecloud incident", "oci outage runbook", "oci failover", "oci instance recovery".

Teams using oraclecloud-incident-runbook should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/oraclecloud-incident-runbook/SKILL.md --create-dirs "https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/main/plugins/saas-packs/oraclecloud-pack/skills/oraclecloud-incident-runbook/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/oraclecloud-incident-runbook/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How oraclecloud-incident-runbook Compares

Feature / Agentoraclecloud-incident-runbookStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Self-service incident runbook for OCI outages — health probes, instance recovery, cross-AD/region failover. Use when OCI instances go down, the status page is silent, or you need automated recovery without waiting for support. Trigger with "oraclecloud incident", "oci outage runbook", "oci failover", "oci instance recovery".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Oracle Cloud Incident Runbook

## Overview

Self-service runbook for when OCI instances go down and the status page stays green. OCI's status page has a history of not acknowledging outages in real time (London Jan 2026 — 502s and instances disappearing for 10 minutes with no status update). OCI Support response times average 4+ hours for Sev-1 tickets. This runbook gives you health probes, automated instance recovery, cross-AD failover, and cross-region failover — all executable without waiting on Oracle.

**Purpose:** Detect OCI service degradation independently, recover instances automatically, and fail over to alternate availability domains or regions when the primary is impacted.

## Prerequisites

- **OCI CLI installed and configured** — `~/.oci/config` validated (see `oraclecloud-install-auth`)
- **Python 3.8+** with the OCI SDK — `pip install oci`
- **Pre-configured resources**: at least one compute instance, a VCN with subnets in multiple ADs
- **IAM policies**: `manage instances`, `manage volumes`, `inspect work-requests` in the target compartment
- **Boot volume backups** enabled (recovery depends on having a recent backup)

## Instructions

### Step 1: Independent Health Probes

Do not trust the OCI status page alone. Run your own health checks against the OCI API:

```python
import oci
import time

config = oci.config.from_file("~/.oci/config")

def probe_oci_health(config):
    """Probe OCI API endpoints independently of the status page."""
    results = {}

    # Probe 1: Identity service (lightest call)
    try:
        start = time.time()
        identity = oci.identity.IdentityClient(config)
        identity.list_regions()
        results["identity"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
    except oci.exceptions.ServiceError as e:
        results["identity"] = {"status": "degraded", "error": str(e.status)}

    # Probe 2: Compute service
    try:
        start = time.time()
        compute = oci.core.ComputeClient(config)
        compute.list_instances(compartment_id=config["tenancy"], limit=1)
        results["compute"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
    except oci.exceptions.ServiceError as e:
        results["compute"] = {"status": "degraded", "error": str(e.status)}

    # Probe 3: Networking service
    try:
        start = time.time()
        network = oci.core.VirtualNetworkClient(config)
        network.list_vcns(compartment_id=config["tenancy"], limit=1)
        results["networking"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
    except oci.exceptions.ServiceError as e:
        results["networking"] = {"status": "degraded", "error": str(e.status)}

    return results

health = probe_oci_health(config)
for service, status in health.items():
    print(f"  {service}: {status['status']} ({status.get('latency_ms', 'N/A')}ms)")
```

### Step 2: Severity Classification

| Level | Condition | Detection | Response Time |
|-------|-----------|-----------|---------------|
| **P1 — Total Outage** | All API probes fail, instances unreachable | All 3 probes return `degraded` | Immediate — trigger cross-region failover |
| **P2 — Partial Degradation** | Some APIs slow (>2s latency), instance agent unresponsive | Latency >2000ms on any probe | 15 min — attempt instance recovery, prepare failover |
| **P3 — Intermittent** | Sporadic 429/500 errors, some requests succeed | Occasional probe failures | 1 hour — enable retries, monitor trend |

### Step 3: Instance Recovery Actions

```bash
# Check current instance state
oci compute instance get --instance-id "$INSTANCE_OCID" \
  --query 'data."lifecycle-state"' --raw-output

# Action: Reset (hard reboot) — fastest recovery for hung instances
oci compute instance action --instance-id "$INSTANCE_OCID" \
  --action RESET

# Action: Reboot (graceful) — for instances still responding to ACPI
oci compute instance action --instance-id "$INSTANCE_OCID" \
  --action SOFTRESET

# Action: Stop then Start — forces reallocation to new host hardware
oci compute instance action --instance-id "$INSTANCE_OCID" \
  --action STOP
# Wait for STOPPED state
oci compute instance action --instance-id "$INSTANCE_OCID" \
  --action START
```

### Step 4: Cross-AD Failover

When an entire availability domain is impacted, launch a replacement instance in a different AD using the latest boot volume backup:

```python
import oci

config = oci.config.from_file("~/.oci/config")
compute = oci.core.ComputeClient(config)
blockstorage = oci.core.BlockstorageClient(config)

# Find latest boot volume backup
backups = blockstorage.list_boot_volume_backups(
    compartment_id="COMPARTMENT_OCID",
    boot_volume_id="BOOT_VOLUME_OCID",
    sort_by="TIMECREATED",
    sort_order="DESC",
    limit=1,
).data

if not backups:
    raise RuntimeError("No boot volume backups found — cannot failover")

latest_backup = backups[0]
print(f"Using backup: {latest_backup.id} from {latest_backup.time_created}")

# Launch replacement instance in alternate AD
launch_details = oci.core.models.LaunchInstanceDetails(
    availability_domain="AD-2",  # Different from the impacted AD
    compartment_id="COMPARTMENT_OCID",
    shape="VM.Standard.E4.Flex",
    shape_config=oci.core.models.LaunchInstanceShapeConfigDetails(ocpus=2, memory_in_gbs=16),
    display_name="failover-instance",
    source_details=oci.core.models.InstanceSourceViaBootVolumeDetails(
        source_type="bootVolume",
        boot_volume_id=latest_backup.id,
    ),
    create_vnic_details=oci.core.models.CreateVnicDetails(
        subnet_id="SUBNET_OCID_IN_AD2",
    ),
)

response = compute.launch_instance(launch_details)
print(f"Failover instance launching: {response.data.id}")
```

### Step 5: Cross-Region Failover

For region-wide outages, use a pre-replicated boot volume in the DR region:

```bash
# List available boot volume replicas in DR region
oci bv boot-volume-backup list \
  --compartment-id "$COMPARTMENT_OCID" \
  --query 'data[0].id' --raw-output \
  --region us-phoenix-1

# Launch instance in DR region
oci compute instance launch \
  --compartment-id "$COMPARTMENT_OCID" \
  --availability-domain "AD-1" \
  --shape "VM.Standard.E4.Flex" \
  --shape-config '{"ocpus":2,"memoryInGBs":16}' \
  --display-name "dr-failover-instance" \
  --source-boot-volume-id "$DR_BACKUP_OCID" \
  --subnet-id "$DR_SUBNET_OCID" \
  --region us-phoenix-1
```

### Step 6: Status Monitoring Script

Run this in a loop during an incident to detect recovery:

```bash
#!/bin/bash
while true; do
  STATUS=$(oci compute instance get --instance-id "$INSTANCE_OCID" \
    --query 'data."lifecycle-state"' --raw-output 2>&1)
  TIMESTAMP=$(date -u +%H:%M:%S)
  echo "$TIMESTAMP | Instance: $STATUS"
  if [ "$STATUS" = "RUNNING" ]; then
    echo "$TIMESTAMP | Instance recovered."
    break
  fi
  sleep 30
done
```

## Output

Successful execution produces:
- Independent health probe results for Identity, Compute, and Networking services
- Severity classification (P1/P2/P3) based on probe results
- Instance recovery action (reset, reboot, or stop/start) applied to the impacted instance
- Cross-AD failover instance launched from the latest boot volume backup (if AD-level failure)
- Cross-region failover instance launched in the DR region (if region-level failure)
- Continuous monitoring log tracking instance lifecycle state until recovery

## Error Handling

| Error | Code | Cause | Solution |
|-------|------|-------|----------|
| NotAuthenticated | 401 | API key expired during incident | Use `oci session authenticate` for token-based short-term auth |
| NotAuthorizedOrNotFound | 404 | IAM policy missing for recovery actions | Pre-create policy: `allow group sre to manage instances in compartment prod` |
| TooManyRequests | 429 | Rate limited during mass recovery | OCI has no Retry-After header — back off 30 seconds between calls |
| InternalError | 500 | OCI service itself is degraded | This confirms the outage — proceed to cross-region failover |
| No boot volume backups | — | Backups not configured | Cannot failover without backups — enable in Block Storage > Boot Volumes > Backup Policies |
| ServiceError status -1 | — | API timeout (region unreachable) | Switch to DR region immediately: `--region us-phoenix-1` |

## Examples

**Quick health check (CLI one-liner):**

```bash
# Test if OCI API is responsive
oci iam region list --output table && echo "OCI API: OK" || echo "OCI API: UNREACHABLE"
```

**Automated recovery wrapper:**

```bash
#!/bin/bash
set -euo pipefail
INSTANCE_OCID="${1:?Usage: recover.sh <instance-ocid>}"
STATE=$(oci compute instance get --instance-id "$INSTANCE_OCID" \
  --query 'data."lifecycle-state"' --raw-output)
case "$STATE" in
  RUNNING) echo "Instance is healthy" ;;
  STOPPED) oci compute instance action --instance-id "$INSTANCE_OCID" --action START ;;
  *)       oci compute instance action --instance-id "$INSTANCE_OCID" --action RESET ;;
esac
```

## Resources

- [OCI Instance Recovery](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/restartinginstance.htm) — stop, start, reboot, and reset actions
- [OCI Status Page](https://ocistatus.oraclecloud.com) — official service health (may lag actual incidents)
- [OCI CLI Reference](https://docs.oracle.com/en-us/iaas/Content/API/Concepts/cliconcepts.htm) — command-line interface documentation
- [OCI Python SDK Reference](https://docs.oracle.com/en-us/iaas/tools/python/latest/) — ComputeClient, BlockstorageClient APIs
- [OCI Known Issues](https://docs.oracle.com/en-us/iaas/Content/knownissues.htm) — platform-known issues and workarounds
- [SDK Troubleshooting](https://docs.oracle.com/en-us/iaas/Content/API/Concepts/sdk_troubleshooting.htm) — connectivity and timeout debugging

## Next Steps

After stabilizing the incident, run `oraclecloud-debug-bundle` to collect forensic data for the postmortem, or review `oraclecloud-prod-checklist` to harden your environment against future outages.

Related Skills

responding-to-security-incidents

1868
from jeremylongshore/claude-code-plugins-plus-skills

Analyze and guide security incident response, investigation, and remediation processes. Use when you need to handle security breaches, classify incidents, develop response playbooks, gather forensic evidence, or coordinate remediation efforts. Trigger with phrases like "security incident response", "ransomware attack response", "data breach investigation", "incident playbook", or "security forensics".

windsurf-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Execute Windsurf incident response when AI features fail or cause production issues. Use when Cascade breaks code, Windsurf service is down, AI-generated code causes production incidents, or team needs emergency Windsurf troubleshooting. Trigger with phrases like "windsurf incident", "windsurf outage", "windsurf broke production", "cascade caused bug", "windsurf emergency".

webflow-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Execute Webflow incident response — triage by HTTP status (401/403/429/500), circuit breaker activation, cached fallback, Webflow status page checks, communication templates, and postmortem process. Trigger with phrases like "webflow incident", "webflow outage", "webflow down", "webflow on-call", "webflow emergency", "webflow broken".

vercel-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Vercel incident response procedures with triage, instant rollback, and postmortem. Use when responding to Vercel-related outages, investigating production errors, or running post-incident reviews for deployment failures. Trigger with phrases like "vercel incident", "vercel outage", "vercel down", "vercel on-call", "vercel emergency", "vercel broken".

veeva-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Veeva Vault incident runbook for enterprise operations. Use when implementing advanced Veeva Vault patterns. Trigger: "veeva incident runbook".

vastai-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Execute Vast.ai incident response for GPU instance failures and outages. Use when responding to instance failures, investigating training crashes, or handling spot preemption emergencies. Trigger with phrases like "vastai incident", "vastai outage", "vastai down", "vastai emergency", "vastai instance failed".

twinmind-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Incident response for TwinMind failures: transcription not starting, audio not captured, sync failures, and calendar disconnect. Use when implementing incident runbook, or managing TwinMind meeting AI operations. Trigger with phrases like "twinmind incident runbook", "twinmind incident runbook".

supabase-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Execute Supabase incident response: dashboard health checks, connection pool status, pg_stat_activity queries, RLS debugging, Edge Function logs, storage health, and escalation. Use when responding to Supabase outages, investigating production errors, debugging connection issues, or preparing evidence for Supabase support escalation. Trigger: "supabase incident", "supabase outage", "supabase down", "supabase on-call", "supabase emergency", "supabase broken", "supabase connection issues".

speak-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Incident response for Speak API outages: triage, fallback to offline mode, and recovery procedures. Use when implementing incident runbook, or managing Speak language learning platform operations. Trigger with phrases like "speak incident runbook", "speak incident runbook".

snowflake-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Execute Snowflake incident response with triage, rollback, and postmortem using real SQL diagnostics. Use when responding to Snowflake outages, investigating query failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "snowflake incident", "snowflake outage", "snowflake down", "snowflake on-call", "snowflake emergency".

shopify-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Execute Shopify incident response with triage using Shopify status page, API health checks, and rate limit diagnosis. Trigger with phrases like "shopify incident", "shopify outage", "shopify down", "shopify on-call", "shopify emergency", "shopify not responding".

sentry-incident-runbook

1868
from jeremylongshore/claude-code-plugins-plus-skills

Execute incident response procedures using Sentry error monitoring. Use when investigating production outages, triaging error spikes, classifying incident severity, or building postmortem reports from Sentry data. Trigger with phrases like "sentry incident", "sentry triage", "investigate sentry error", "sentry runbook", "production incident sentry".