AI Agent Skill HUB

DevOps & Infrastructure

IT Disaster Recovery Plan Generator

Build production-ready disaster recovery plans that actually get followed when things break.

3,891 stars

Complexity: easy

View on GitHub Installation ↓

About this skill

This AI agent skill automates the creation of detailed and actionable IT disaster recovery (DR) plans. It comprehensively addresses all critical components, including infrastructure, data stores, applications, and communication protocols. The generated output provides specific RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets, outlines clear failover procedures, defines testing schedules, and even includes cost modeling to provide a holistic recovery strategy. Users should leverage this skill when needing to establish robust DR documentation for regulatory compliance standards such as SOC 2, ISO 27001, or HIPAA. It's also invaluable for addressing identified weaknesses after a real-world outage, for standardizing recovery procedures when onboarding new infrastructure teams, or as part of an annual review cycle to keep existing DR plans current and effective. By utilizing this skill, organizations can significantly reduce the manual effort involved in crafting complex DR plans, ensuring that critical systems are resilient and can recover swiftly from disruptions. It helps build a proactive posture against potential disasters, minimizes business impact, and fosters confidence in an organization's ability to maintain continuity of operations.

Best use case

The primary use case is to generate comprehensive and actionable IT disaster recovery plans, crucial for ensuring business continuity and compliance. This skill benefits IT managers, DevOps teams, compliance officers, and system architects who need to establish robust recovery strategies, meet regulatory requirements, or rapidly respond to outages with predefined procedures.

Build production-ready disaster recovery plans that actually get followed when things break.

Users should expect a detailed and structured IT Disaster Recovery Plan document, complete with risk assessments, recovery tiers, and step-by-step procedures.

Practical example

Example input

Generate a disaster recovery plan for our SaaS platform. Stack: AWS (us-east-1 primary, eu-west-1 secondary), PostgreSQL RDS, Redis, S3. RTO target: 4 hours. RPO target: 1 hour. Team size: 8 engineers.

Example output

A structured output beginning with a Risk Assessment Matrix (e.g., Threat: Region outage, Impact: 5, Mitigation: Multi-region active-active) followed by Recovery Tier Classification for critical applications (e.g., Tier 1 - Critical (RTO < 1hr): Authentication service, Payment processing).

When to use this skill

Building DR documentation for compliance (SOC 2, ISO 27001, HIPAA)
After an outage exposed gaps in your recovery process
Onboarding a new infrastructure team
Annual DR plan review and update

When not to use this skill

For real-time incident response and execution during an active disaster.
To automatically implement DR procedures; it only generates the plan.
When a detailed, highly customized, manual DR plan is already complete and up-to-date.
For non-critical personal projects that don't require formal DR documentation.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/afrexai-disaster-recovery/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/1kalin/afrexai-disaster-recovery/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/afrexai-disaster-recovery/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How IT Disaster Recovery Plan Generator Compares

Feature / Agent	IT Disaster Recovery Plan Generator	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	easy	N/A

Frequently Asked Questions

What does this skill do?

Build production-ready disaster recovery plans that actually get followed when things break.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

Top AI Agents for Productivity

See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.

SKILL.md Source

# IT Disaster Recovery Plan Generator

Build production-ready disaster recovery plans that actually get followed when things break.

## What This Does

Generates a complete DR plan covering infrastructure, data, applications, and communications. Output includes RTO/RPO targets, failover procedures, testing schedules, and cost modeling.

## When to Use

- Building DR documentation for compliance (SOC 2, ISO 27001, HIPAA)
- After an outage exposed gaps in your recovery process
- Onboarding a new infrastructure team
- Annual DR plan review and update

## How to Use

Tell the agent what you need. Be specific about your stack and requirements.

### Quick Start
```
Generate a disaster recovery plan for our SaaS platform. Stack: AWS (us-east-1 primary, eu-west-1 secondary), PostgreSQL RDS, Redis, S3. RTO target: 4 hours. RPO target: 1 hour. Team size: 8 engineers.
```

### Inputs to Provide
- **Infrastructure**: Cloud provider, regions, key services
- **Data stores**: Databases, object storage, message queues
- **RTO target**: Maximum acceptable downtime
- **RPO target**: Maximum acceptable data loss
- **Team size**: Who's available during an incident
- **Compliance**: Which frameworks apply (SOC 2, ISO 27001, HIPAA, PCI DSS)
- **Budget tier**: Startup ($5K-$15K/yr) | Growth ($15K-$50K/yr) | Enterprise ($50K+/yr)

## Output Structure

### 1. Risk Assessment Matrix
| Threat | Likelihood (1-5) | Impact (1-5) | Risk Score | Mitigation |
|--------|------------------|--------------|------------|------------|
| Region outage | 2 | 5 | 10 | Multi-region active-active |
| Database corruption | 3 | 5 | 15 | Point-in-time recovery + cross-region replicas |
| Ransomware | 3 | 5 | 15 | Immutable backups + air-gapped copies |
| DNS failure | 2 | 4 | 8 | Multiple DNS providers |
| Key person unavailable | 4 | 3 | 12 | Runbook documentation + cross-training |

### 2. Recovery Tier Classification
**Tier 1 — Critical (RTO < 1hr)**
- Authentication service
- Payment processing
- Core API

**Tier 2 — Important (RTO < 4hr)**
- Admin dashboard
- Reporting
- Email delivery

**Tier 3 — Standard (RTO < 24hr)**
- Analytics
- Internal tools
- Dev/staging environments

### 3. Failover Procedures
For each Tier 1 service, generate step-by-step runbooks:
- Pre-failover health checks
- DNS/load balancer switchover steps
- Data consistency verification
- Post-failover smoke tests
- Rollback procedure if failover fails

### 4. Backup Strategy
| Data Store | Backup Frequency | Retention | Location | Recovery Test Frequency |
|-----------|-----------------|-----------|----------|----------------------|
| Primary DB | Continuous (WAL) | 30 days | Cross-region | Monthly |
| Object Storage | Cross-region replication | Indefinite | Secondary region | Quarterly |
| Config/Secrets | On change | 90 days | Encrypted S3 + local | Monthly |

### 5. Communication Plan
- **Internal escalation**: PagerDuty/Opsgenie chain with backup contacts
- **Status page**: Auto-update triggers at incident declaration
- **Customer notification**: Templates for P1-P4 severity levels
- **Executive briefing**: 15-min cadence during P1, hourly during P2

### 6. Testing Schedule
| Test Type | Frequency | Scope | Duration |
|-----------|-----------|-------|----------|
| Tabletop exercise | Quarterly | Full team walkthrough | 2 hours |
| Component failover | Monthly | Individual service | 1 hour |
| Full DR simulation | Annually | Complete failover | 4-8 hours |
| Backup restore | Monthly | Random data store | 1 hour |

### 7. Cost Model
Break down DR spending by category:
- Infrastructure (standby capacity, cross-region replication)
- Tooling (monitoring, alerting, backup software)
- Testing (engineer hours, cloud costs during drills)
- Training (onboarding, annual refreshers)

Benchmark: DR typically costs 15-25% of primary infrastructure spend. Companies without DR plans face average downtime costs of $5,600/minute.

## Compliance Mapping

Map each DR control to framework requirements:
- **SOC 2 CC7.4/CC7.5**: Incident response and recovery
- **ISO 27001 A.17**: Information security continuity
- **HIPAA §164.308(a)(7)**: Contingency plan
- **PCI DSS 12.10**: Incident response plan

## Rules
- Always include specific commands and CLI examples (not just "failover the database")
- Include estimated time for each step in runbooks
- Flag single points of failure explicitly
- Default to the 3-2-1 backup rule: 3 copies, 2 media types, 1 offsite
- Include cost estimates in USD for each recommendation
- Never assume unlimited budget — tier recommendations by cost

## Next Steps

Want to go deeper? Check out the full [AI Context Packs](https://afrexai-cto.github.io/context-packs/) — pre-built knowledge bases for SaaS, Healthcare, Legal, Manufacturing, and more. $47 per industry pack, or grab all 10 for $197.

Calculate what manual DR planning costs your team: [AI Revenue Calculator](https://afrexai-cto.github.io/ai-revenue-calculator/)

Set up your agent stack in 5 minutes: [Agent Setup Wizard](https://afrexai-cto.github.io/agent-setup/)

Related Skills

Incident Postmortem Generator

from openclaw/skills

Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.

DevOps & Infrastructure

botlearn-healthcheck

from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure

Post-Mortem & Incident Review Framework

from openclaw/skills

Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.

DevOps & Infrastructure

afrexai-performance-engineering

from openclaw/skills

Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applications, optimizing code/queries/infrastructure, load testing before launch, planning capacity, or building performance into CI/CD. Covers Node.js, Python, Go, Java, databases, APIs, and frontend.

DevOps & Infrastructure

OpenClaw Mastery — The Complete Agent Engineering & Operations System

from openclaw/skills

> Built by AfrexAI — the team that runs 9+ production agents 24/7 on OpenClaw.

DevOps & Infrastructure

Legacy System Modernization Engine

from openclaw/skills

Complete methodology for assessing, planning, and executing legacy system modernization — from monolith decomposition to cloud migration. Works for any tech stack, any scale.

DevOps & Infrastructure

Incident Response Playbook

from openclaw/skills

Structured incident response for business and IT teams. Guides you through detection, triage, containment, resolution, and post-mortem — with auto-generated timelines and action items.

DevOps & Infrastructure

Git Engineering & Repository Strategy

from openclaw/skills

You are a Git Engineering expert. You help teams design branching strategies, implement code review workflows, manage monorepos, automate releases, and maintain healthy repository practices at scale.

DevOps & Infrastructure

Django Production Engineering

from openclaw/skills

Complete methodology for building, scaling, and operating production Django applications. From project structure to deployment, security to performance — every decision framework a Django team needs.

DevOps & Infrastructure

afrexai-api-architect

from openclaw/skills

Design, build, test, document, and secure production-grade APIs. Covers the full lifecycle from schema design through deployment, monitoring, and versioning. Use when designing new APIs, reviewing existing ones, generating OpenAPI specs, building test suites, or debugging production issues.

DevOps & Infrastructure

Agent Ops Runbook

from openclaw/skills

Generate a production-ready operations runbook for deploying AI agents. Covers pre-deployment checklists, shadow mode → supervised → autonomous rollout stages, monitoring dashboards, rollback procedures, cost management, and incident response templates.

DevOps & Infrastructure

node-red-manager

from openclaw/skills

Manage Node-RED instances via Admin API or CLI. Automate flow deployment, install nodes, and troubleshoot issues. Use when user wants to "build automation", "connect devices", or "fix node-red".

DevOps & Infrastructure