IT Disaster Recovery Plan Generator
Build production-ready disaster recovery plans that actually get followed when things break.
About this skill
This AI agent skill automates the creation of detailed and actionable IT disaster recovery (DR) plans. It comprehensively addresses all critical components, including infrastructure, data stores, applications, and communication protocols. The generated output provides specific RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets, outlines clear failover procedures, defines testing schedules, and even includes cost modeling to provide a holistic recovery strategy. Users should leverage this skill when needing to establish robust DR documentation for regulatory compliance standards such as SOC 2, ISO 27001, or HIPAA. It's also invaluable for addressing identified weaknesses after a real-world outage, for standardizing recovery procedures when onboarding new infrastructure teams, or as part of an annual review cycle to keep existing DR plans current and effective. By utilizing this skill, organizations can significantly reduce the manual effort involved in crafting complex DR plans, ensuring that critical systems are resilient and can recover swiftly from disruptions. It helps build a proactive posture against potential disasters, minimizes business impact, and fosters confidence in an organization's ability to maintain continuity of operations.
Best use case
The primary use case is to generate comprehensive and actionable IT disaster recovery plans, crucial for ensuring business continuity and compliance. This skill benefits IT managers, DevOps teams, compliance officers, and system architects who need to establish robust recovery strategies, meet regulatory requirements, or rapidly respond to outages with predefined procedures.
Build production-ready disaster recovery plans that actually get followed when things break.
Users should expect a detailed and structured IT Disaster Recovery Plan document, complete with risk assessments, recovery tiers, and step-by-step procedures.
Practical example
Example input
Generate a disaster recovery plan for our SaaS platform. Stack: AWS (us-east-1 primary, eu-west-1 secondary), PostgreSQL RDS, Redis, S3. RTO target: 4 hours. RPO target: 1 hour. Team size: 8 engineers.
Example output
A structured output beginning with a Risk Assessment Matrix (e.g., Threat: Region outage, Impact: 5, Mitigation: Multi-region active-active) followed by Recovery Tier Classification for critical applications (e.g., Tier 1 - Critical (RTO < 1hr): Authentication service, Payment processing).
When to use this skill
- Building DR documentation for compliance (SOC 2, ISO 27001, HIPAA)
- After an outage exposed gaps in your recovery process
- Onboarding a new infrastructure team
- Annual DR plan review and update
When not to use this skill
- For real-time incident response and execution during an active disaster.
- To automatically implement DR procedures; it only generates the plan.
- When a detailed, highly customized, manual DR plan is already complete and up-to-date.
- For non-critical personal projects that don't require formal DR documentation.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/afrexai-disaster-recovery/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How IT Disaster Recovery Plan Generator Compares
| Feature / Agent | IT Disaster Recovery Plan Generator | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | easy | N/A |
Frequently Asked Questions
What does this skill do?
Build production-ready disaster recovery plans that actually get followed when things break.
How difficult is it to install?
The installation complexity is rated as easy. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
SKILL.md Source
# IT Disaster Recovery Plan Generator Build production-ready disaster recovery plans that actually get followed when things break. ## What This Does Generates a complete DR plan covering infrastructure, data, applications, and communications. Output includes RTO/RPO targets, failover procedures, testing schedules, and cost modeling. ## When to Use - Building DR documentation for compliance (SOC 2, ISO 27001, HIPAA) - After an outage exposed gaps in your recovery process - Onboarding a new infrastructure team - Annual DR plan review and update ## How to Use Tell the agent what you need. Be specific about your stack and requirements. ### Quick Start ``` Generate a disaster recovery plan for our SaaS platform. Stack: AWS (us-east-1 primary, eu-west-1 secondary), PostgreSQL RDS, Redis, S3. RTO target: 4 hours. RPO target: 1 hour. Team size: 8 engineers. ``` ### Inputs to Provide - **Infrastructure**: Cloud provider, regions, key services - **Data stores**: Databases, object storage, message queues - **RTO target**: Maximum acceptable downtime - **RPO target**: Maximum acceptable data loss - **Team size**: Who's available during an incident - **Compliance**: Which frameworks apply (SOC 2, ISO 27001, HIPAA, PCI DSS) - **Budget tier**: Startup ($5K-$15K/yr) | Growth ($15K-$50K/yr) | Enterprise ($50K+/yr) ## Output Structure ### 1. Risk Assessment Matrix | Threat | Likelihood (1-5) | Impact (1-5) | Risk Score | Mitigation | |--------|------------------|--------------|------------|------------| | Region outage | 2 | 5 | 10 | Multi-region active-active | | Database corruption | 3 | 5 | 15 | Point-in-time recovery + cross-region replicas | | Ransomware | 3 | 5 | 15 | Immutable backups + air-gapped copies | | DNS failure | 2 | 4 | 8 | Multiple DNS providers | | Key person unavailable | 4 | 3 | 12 | Runbook documentation + cross-training | ### 2. Recovery Tier Classification **Tier 1 — Critical (RTO < 1hr)** - Authentication service - Payment processing - Core API **Tier 2 — Important (RTO < 4hr)** - Admin dashboard - Reporting - Email delivery **Tier 3 — Standard (RTO < 24hr)** - Analytics - Internal tools - Dev/staging environments ### 3. Failover Procedures For each Tier 1 service, generate step-by-step runbooks: - Pre-failover health checks - DNS/load balancer switchover steps - Data consistency verification - Post-failover smoke tests - Rollback procedure if failover fails ### 4. Backup Strategy | Data Store | Backup Frequency | Retention | Location | Recovery Test Frequency | |-----------|-----------------|-----------|----------|----------------------| | Primary DB | Continuous (WAL) | 30 days | Cross-region | Monthly | | Object Storage | Cross-region replication | Indefinite | Secondary region | Quarterly | | Config/Secrets | On change | 90 days | Encrypted S3 + local | Monthly | ### 5. Communication Plan - **Internal escalation**: PagerDuty/Opsgenie chain with backup contacts - **Status page**: Auto-update triggers at incident declaration - **Customer notification**: Templates for P1-P4 severity levels - **Executive briefing**: 15-min cadence during P1, hourly during P2 ### 6. Testing Schedule | Test Type | Frequency | Scope | Duration | |-----------|-----------|-------|----------| | Tabletop exercise | Quarterly | Full team walkthrough | 2 hours | | Component failover | Monthly | Individual service | 1 hour | | Full DR simulation | Annually | Complete failover | 4-8 hours | | Backup restore | Monthly | Random data store | 1 hour | ### 7. Cost Model Break down DR spending by category: - Infrastructure (standby capacity, cross-region replication) - Tooling (monitoring, alerting, backup software) - Testing (engineer hours, cloud costs during drills) - Training (onboarding, annual refreshers) Benchmark: DR typically costs 15-25% of primary infrastructure spend. Companies without DR plans face average downtime costs of $5,600/minute. ## Compliance Mapping Map each DR control to framework requirements: - **SOC 2 CC7.4/CC7.5**: Incident response and recovery - **ISO 27001 A.17**: Information security continuity - **HIPAA §164.308(a)(7)**: Contingency plan - **PCI DSS 12.10**: Incident response plan ## Rules - Always include specific commands and CLI examples (not just "failover the database") - Include estimated time for each step in runbooks - Flag single points of failure explicitly - Default to the 3-2-1 backup rule: 3 copies, 2 media types, 1 offsite - Include cost estimates in USD for each recommendation - Never assume unlimited budget — tier recommendations by cost ## Next Steps Want to go deeper? Check out the full [AI Context Packs](https://afrexai-cto.github.io/context-packs/) — pre-built knowledge bases for SaaS, Healthcare, Legal, Manufacturing, and more. $47 per industry pack, or grab all 10 for $197. Calculate what manual DR planning costs your team: [AI Revenue Calculator](https://afrexai-cto.github.io/ai-revenue-calculator/) Set up your agent stack in 5 minutes: [Agent Setup Wizard](https://afrexai-cto.github.io/agent-setup/)
Related Skills
Incident Postmortem Generator
Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.
botlearn-healthcheck
botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.
Post-Mortem & Incident Review Framework
Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.
afrexai-performance-engineering
Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applications, optimizing code/queries/infrastructure, load testing before launch, planning capacity, or building performance into CI/CD. Covers Node.js, Python, Go, Java, databases, APIs, and frontend.
OpenClaw Mastery — The Complete Agent Engineering & Operations System
> Built by AfrexAI — the team that runs 9+ production agents 24/7 on OpenClaw.
Legacy System Modernization Engine
Complete methodology for assessing, planning, and executing legacy system modernization — from monolith decomposition to cloud migration. Works for any tech stack, any scale.
Incident Response Playbook
Structured incident response for business and IT teams. Guides you through detection, triage, containment, resolution, and post-mortem — with auto-generated timelines and action items.
Git Engineering & Repository Strategy
You are a Git Engineering expert. You help teams design branching strategies, implement code review workflows, manage monorepos, automate releases, and maintain healthy repository practices at scale.
Django Production Engineering
Complete methodology for building, scaling, and operating production Django applications. From project structure to deployment, security to performance — every decision framework a Django team needs.
afrexai-api-architect
Design, build, test, document, and secure production-grade APIs. Covers the full lifecycle from schema design through deployment, monitoring, and versioning. Use when designing new APIs, reviewing existing ones, generating OpenAPI specs, building test suites, or debugging production issues.
Agent Ops Runbook
Generate a production-ready operations runbook for deploying AI agents. Covers pre-deployment checklists, shadow mode → supervised → autonomous rollout stages, monitoring dashboards, rollback procedures, cost management, and incident response templates.
node-red-manager
Manage Node-RED instances via Admin API or CLI. Automate flow deployment, install nodes, and troubleshoot issues. Use when user wants to "build automation", "connect devices", or "fix node-red".