afrexai-self-hosting-mastery
Complete self-hosting and homelab operating system. Deploy, secure, monitor, and maintain self-hosted services with production-grade reliability. Use when setting up home servers, Docker infrastructure, reverse proxies, backups, monitoring, or evaluating self-hosted alternatives to SaaS.
Best use case
afrexai-self-hosting-mastery is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Complete self-hosting and homelab operating system. Deploy, secure, monitor, and maintain self-hosted services with production-grade reliability. Use when setting up home servers, Docker infrastructure, reverse proxies, backups, monitoring, or evaluating self-hosted alternatives to SaaS.
Complete self-hosting and homelab operating system. Deploy, secure, monitor, and maintain self-hosted services with production-grade reliability. Use when setting up home servers, Docker infrastructure, reverse proxies, backups, monitoring, or evaluating self-hosted alternatives to SaaS.
Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.
Practical example
Example input
Use the "afrexai-self-hosting-mastery" skill to help with this workflow task. Context: Complete self-hosting and homelab operating system. Deploy, secure, monitor, and maintain self-hosted services with production-grade reliability. Use when setting up home servers, Docker infrastructure, reverse proxies, backups, monitoring, or evaluating self-hosted alternatives to SaaS.
Example output
A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.
When to use this skill
- Use this skill when you want a reusable workflow rather than writing the same prompt again and again.
When not to use this skill
- Do not use this when you only need a one-off answer and do not need a reusable workflow.
- Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/afrexai-self-hosting-mastery/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How afrexai-self-hosting-mastery Compares
| Feature / Agent | afrexai-self-hosting-mastery | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Complete self-hosting and homelab operating system. Deploy, secure, monitor, and maintain self-hosted services with production-grade reliability. Use when setting up home servers, Docker infrastructure, reverse proxies, backups, monitoring, or evaluating self-hosted alternatives to SaaS.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
AI Agent for SaaS Idea Validation
Use AI agent skills for SaaS idea validation, market research, customer discovery, competitor analysis, and documenting startup hypotheses.
AI Agent for Product Research
Browse AI agent skills for product research, competitive analysis, customer discovery, and structured product decision support.
SKILL.md Source
# Self-Hosting Mastery
Complete system for building and operating reliable self-hosted infrastructure — from first server to multi-node homelab.
## Phase 1: Infrastructure Assessment
### Server Profile YAML
```yaml
server_profile:
name: ""
hardware:
cpu: "" # e.g., "Intel i5-12400" or "Raspberry Pi 5"
ram_gb: 0
storage:
- device: "" # e.g., "/dev/sda"
type: "" # ssd | hdd | nvme
size_gb: 0
role: "" # boot | data | backup
network: "" # 1gbe | 2.5gbe | 10gbe
os: "" # debian | ubuntu | proxmox | unraid | truenas
location: "" # home | closet | rack | colo | vps
power:
ups: false
wattage_idle: 0
wattage_load: 0
monthly_cost_estimate: "" # electricity
network:
public_ip: "" # static | dynamic | cgnat
domain: ""
dns_provider: "" # cloudflare | duckdns | custom
isp_ports_open: true # some ISPs block 80/443
goals:
- "" # media server, smart home, dev environment, etc.
budget_monthly: "" # electricity + domain + any VPS
```
### Hardware Decision Matrix
| Budget | RAM | Storage | Good For | Example Hardware |
|--------|-----|---------|----------|-----------------|
| $0 | 4-8GB | 64GB+ | Pi-hole, AdGuard, small tools | Raspberry Pi 4/5 |
| $50-150 | 8-16GB | 256GB+ | Docker host, 5-10 services | Used SFF PC (Dell Optiplex, Lenovo Tiny) |
| $150-400 | 16-32GB | 1TB+ | NAS + services, media server | Mini PC (Intel NUC, Beelink) |
| $400-800 | 32-64GB | 4TB+ | Full homelab, VMs + containers | Used enterprise (Dell R720, HP DL380) |
| $800+ | 64GB+ | 10TB+ | Multi-node, Proxmox cluster | Multiple nodes, dedicated NAS |
### Self-Host vs SaaS Decision
Ask before self-hosting anything:
1. **Data sensitivity** — Does keeping data local matter? (passwords, health, finance = yes)
2. **Reliability need** — Can you tolerate occasional downtime? (email = risky, media = fine)
3. **Maintenance budget** — Do you have 2-4 hours/month for updates?
4. **Skill level** — Can you debug Docker/networking issues?
5. **Cost comparison** — Is the SaaS < $10/mo? Often not worth self-hosting for trivial savings.
**Always self-host**: Password manager, DNS/ad-blocking, VPN, bookmarks, notes
**Usually self-host**: Media server, file sync, photo backup, monitoring, git
**Think twice**: Email (deliverability hell), calendar (sync complexity), chat (uptime expectations)
**Rarely worth it**: Search engine (resource hungry), social media (no network effect)
---
## Phase 2: OS & Virtualization
### OS Selection Guide
| OS | Best For | Learning Curve | Notes |
|----|----------|---------------|-------|
| Debian 12 | Docker-only host | Low | Stable, minimal, just works |
| Ubuntu Server 24.04 | Beginners, wide docs | Low | More packages, snap controversy |
| Proxmox VE | VMs + containers | Medium | Free, enterprise features, ZFS |
| Unraid | NAS + Docker + VMs | Medium | $59-129, great UI, parity array |
| TrueNAS Scale | ZFS NAS + Docker | Medium | Free, ZFS-first, apps improving |
| NixOS | Reproducible configs | High | Declarative, steep learning curve |
### Proxmox Quick Setup
```bash
# Post-install essentials
# 1. Remove enterprise repo (if no subscription)
sed -i 's/^deb/#deb/' /etc/apt/sources.list.d/pve-enterprise.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-no-subscription.list
apt update && apt upgrade -y
# 2. Create a Docker LXC (lightweight container)
# Download template: Datacenter → Storage → CT Templates → Download → debian-12
# Create CT: 2 cores, 2GB RAM, 32GB disk, bridge vmbr0
# Inside CT: install Docker
apt install -y curl
curl -fsSL https://get.docker.com | sh
# 3. Enable IOMMU for GPU passthrough (if needed)
# Edit /etc/default/grub: GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
# update-grub && reboot
```
### VM vs LXC vs Docker Decision
| Factor | VM | LXC | Docker |
|--------|----|-----|--------|
| Isolation | Full (own kernel) | Partial (shared kernel) | Process-level |
| Overhead | High (1-2GB base) | Low (50-200MB) | Minimal |
| Use when | Different OS, GPU passthrough, untrusted workloads | Dedicated service host, ZFS datasets | Most services |
| Avoid when | RAM-constrained | Need Windows, custom kernel | Stateful databases (use LXC/VM) |
**Rule**: Docker for 90% of services. LXC for Docker hosts or isolated environments. VM for Windows, different kernel needs, or GPU passthrough.
---
## Phase 3: Docker Infrastructure
### Docker Compose Project Structure
```
/opt/stacks/ # or ~/docker/
├── traefik/
│ ├── docker-compose.yml
│ ├── .env
│ ├── config/
│ │ └── traefik.yml
│ └── data/
│ ├── acme.json # chmod 600
│ └── dynamic/
├── monitoring/
│ ├── docker-compose.yml
│ ├── .env
│ └── config/
├── media/
│ ├── docker-compose.yml
│ ├── .env
│ └── config/
├── productivity/
│ ├── docker-compose.yml
│ ├── .env
│ └── config/
└── scripts/
├── backup.sh
├── update-all.sh
└── health-check.sh
```
### Docker Compose Best Practices
```yaml
# Template: production-grade service
services:
app:
image: vendor/app:1.2.3 # ALWAYS pin version
container_name: app # Explicit name
restart: unless-stopped # Auto-restart
networks:
- proxy # Traefik network
- internal # Backend network
volumes:
- ./config:/config # Bind mount for config
- app-data:/data # Named volume for data
environment:
- TZ=Europe/London # Always set timezone
- PUID=1000 # Match host user
- PGID=1000
env_file:
- .env # Secrets in .env (gitignored)
labels:
- "traefik.enable=true"
- "traefik.http.routers.app.rule=Host(`app.example.com`)"
- "traefik.http.routers.app.tls.certresolver=letsencrypt"
- "traefik.http.services.app.loadbalancer.server.port=8080"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
deploy:
resources:
limits:
memory: 512M # Prevent OOM cascades
security_opt:
- no-new-privileges:true # Security hardening
read_only: true # Where possible
tmpfs:
- /tmp
volumes:
app-data:
networks:
proxy:
external: true
internal:
```
### Docker Security Checklist
- [ ] Pin all image versions (never `:latest` in production)
- [ ] Set `restart: unless-stopped` on all services
- [ ] Use `.env` files for secrets (never hardcode in compose)
- [ ] Set memory limits on all containers
- [ ] Use `security_opt: no-new-privileges:true`
- [ ] Use `read_only: true` where possible + tmpfs for /tmp
- [ ] Create separate Docker networks per stack
- [ ] Never expose database ports to 0.0.0.0
- [ ] Run containers as non-root (PUID/PGID or `user:`)
- [ ] Enable Docker content trust: `export DOCKER_CONTENT_TRUST=1`
- [ ] Prune unused images/volumes monthly: `docker system prune -af`
- [ ] Use named volumes (not anonymous) for all persistent data
- [ ] Set `TZ` environment variable on every container
---
## Phase 4: Reverse Proxy & SSL
### Reverse Proxy Selection
| Proxy | Best For | SSL | Config Style | Learning Curve |
|-------|----------|-----|-------------|---------------|
| Traefik | Docker-native, auto-discovery | Auto (ACME) | Labels + YAML | Medium |
| Caddy | Simplicity, auto-SSL | Auto (built-in) | Caddyfile | Low |
| Nginx Proxy Manager | GUI preference | Auto (UI) | Web UI | Very Low |
| Nginx (manual) | Maximum control | Manual/certbot | Config files | High |
**Recommendation**: Traefik for Docker power users. Caddy for simplicity. NPM for beginners.
### Traefik Production Config
```yaml
# traefik/config/traefik.yml
api:
dashboard: true
insecure: false
entryPoints:
web:
address: ":80"
http:
redirections:
entryPoint:
to: websecure
scheme: https
websecure:
address: ":443"
http:
tls:
certResolver: letsencrypt
certificatesResolvers:
letsencrypt:
acme:
email: you@example.com
storage: /data/acme.json
# Use DNS challenge if ISP blocks port 80
# dnsChallenge:
# provider: cloudflare
httpChallenge:
entryPoint: web
providers:
docker:
exposedByDefault: false # Explicit opt-in per service
network: proxy
file:
directory: /data/dynamic
watch: true
log:
level: WARN
accessLog:
filePath: /data/access.log
bufferingSize: 100
```
### Cloudflare Tunnel (Zero Port Forwarding)
For CGNAT or ISPs blocking ports — expose services without opening firewall:
```yaml
# cloudflared/docker-compose.yml
services:
cloudflared:
image: cloudflare/cloudflared:2024.1.0
container_name: cloudflared
restart: unless-stopped
command: tunnel run
environment:
- TUNNEL_TOKEN=${CF_TUNNEL_TOKEN}
networks:
- proxy
```
**When to use Cloudflare Tunnel vs port forwarding**:
- CGNAT (no public IP) → Tunnel (only option)
- ISP blocks 80/443 → Tunnel or DNS challenge + non-standard ports
- Security-first → Tunnel (no open ports)
- Performance-first → Direct (lower latency)
- LAN-only access → Neither (use Tailscale/WireGuard)
---
## Phase 5: Essential Services Stack
### Tier 1 — Deploy First (Foundation)
| Service | Purpose | Image | RAM | Notes |
|---------|---------|-------|-----|-------|
| Traefik/Caddy | Reverse proxy + SSL | traefik:v3.0 | 64MB | Gateway to everything |
| Pi-hole/AdGuard | DNS + ad blocking | pihole/pihole | 128MB | Network-wide ad blocking |
| Authelia/Authentik | SSO + 2FA | authelia/authelia | 128MB | Protect services without built-in auth |
| Uptime Kuma | Monitoring | louislam/uptime-kuma | 128MB | Know when things break |
| Watchtower | Auto-updates | containrrr/watchtower | 32MB | Optional — some prefer manual |
### Tier 2 — Core Services
| Service | Purpose | Alt | RAM |
|---------|---------|-----|-----|
| Vaultwarden | Password manager | Bitwarden | 64MB |
| Nextcloud | File sync + office | Seafile (lighter) | 512MB |
| Immich | Photo backup | PhotoPrism | 1-4GB |
| Jellyfin | Media server | Plex (less free) | 512MB-2GB |
| Paperless-ngx | Document management | - | 256MB |
| Home Assistant | Smart home | - | 512MB |
### Tier 3 — Power User
| Service | Purpose | RAM |
|---------|---------|-----|
| Gitea/Forgejo | Git hosting | 256MB |
| n8n | Workflow automation | 256MB |
| Grafana + Prometheus | Metrics & dashboards | 512MB |
| Tandoor | Recipe management | 256MB |
| Mealie | Meal planning | 128MB |
| Linkwarden/Hoarder | Bookmark manager | 256MB |
| Stirling PDF | PDF tools | 512MB |
| IT-Tools | Developer utilities | 64MB |
### RAM Planning
```
Total RAM needed ≈ OS base (1-2GB) + sum of service RAM + 20% headroom
Example 16GB server:
OS + Docker: 2 GB
Traefik: 0.1 GB
Pi-hole: 0.1 GB
Authelia: 0.1 GB
Uptime Kuma: 0.1 GB
Vaultwarden: 0.1 GB
Nextcloud: 0.5 GB
Immich: 2.0 GB
Jellyfin: 1.0 GB
Paperless: 0.3 GB
Home Assistant: 0.5 GB
──────────────────────
Total: 6.8 GB → 8.2 GB with headroom
Available: ~7.8 GB free for more services
```
---
## Phase 6: Networking & DNS
### DNS Architecture
```
Internet → Cloudflare DNS → Your Public IP → Router → Server
↓
Reverse Proxy (Traefik)
↓
┌──────────────────┼──────────────────┐
↓ ↓ ↓
app.domain.com files.domain.com media.domain.com
```
### Split DNS (Access Services Locally Without Hairpin NAT)
```
# Pi-hole/AdGuard: Local DNS rewrites
# Point *.home.example.com → 192.168.1.100 (server LAN IP)
# External: Cloudflare points to public IP
# Result: LAN traffic stays local, external goes through internet
```
### VPN for Remote Access
| Solution | Type | Best For | Complexity |
|----------|------|----------|-----------|
| Tailscale | Mesh VPN | Easiest setup, multi-device | Very Low |
| WireGuard | Point-to-point | Performance, full control | Medium |
| Headscale | Self-hosted Tailscale | Privacy, no vendor lock | Medium-High |
**Recommendation**: Start with Tailscale (free for 3 users). Move to Headscale when you want full control.
### Firewall Rules (UFW)
```bash
# Default deny incoming
ufw default deny incoming
ufw default allow outgoing
# Allow SSH (change port from 22!)
ufw allow 2222/tcp comment 'SSH'
# Allow HTTP/HTTPS for reverse proxy
ufw allow 80/tcp comment 'HTTP redirect'
ufw allow 443/tcp comment 'HTTPS'
# Allow local network for discovery
ufw allow from 192.168.1.0/24 comment 'LAN'
# Enable
ufw enable
```
---
## Phase 7: Backup Strategy
### 3-2-1 Rule Implementation
```
3 copies: Live data + Local backup + Remote backup
2 media: SSD/HDD (server) + External drive or NAS
1 offsite: Cloud (Backblaze B2, Wasabi) or second location
```
### Backup Script Template
```bash
#!/bin/bash
# /opt/stacks/scripts/backup.sh
set -euo pipefail
BACKUP_DIR="/mnt/backup/docker"
STACKS_DIR="/opt/stacks"
DATE=$(date +%Y-%m-%d_%H%M)
RETENTION_DAYS=30
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }
# 1. Stop services that need consistent backups
log "Stopping database services..."
cd "$STACKS_DIR/productivity" && docker compose stop db
# 2. Backup Docker volumes
log "Backing up volumes..."
for vol in $(docker volume ls -q); do
docker run --rm \
-v "$vol":/source:ro \
-v "$BACKUP_DIR/volumes":/backup \
alpine tar czf "/backup/${vol}_${DATE}.tar.gz" -C /source .
done
# 3. Backup compose files and configs
log "Backing up configs..."
tar czf "$BACKUP_DIR/configs/stacks_${DATE}.tar.gz" \
--exclude='*.log' \
--exclude='node_modules' \
"$STACKS_DIR"
# 4. Restart services
log "Restarting services..."
cd "$STACKS_DIR/productivity" && docker compose start db
# 5. Cleanup old backups
log "Cleaning up backups older than ${RETENTION_DAYS} days..."
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete
# 6. Sync to remote (Backblaze B2 example)
# rclone sync "$BACKUP_DIR" b2:my-backups/docker/ --transfers 4
# 7. Verify
BACKUP_SIZE=$(du -sh "$BACKUP_DIR" | cut -f1)
log "Backup complete. Total size: $BACKUP_SIZE"
# 8. Send notification (optional)
# curl -s "https://ntfy.sh/my-backups" -d "Backup complete: $BACKUP_SIZE"
```
### Backup Schedule
| What | Frequency | Retention | Method |
|------|-----------|-----------|--------|
| Docker volumes | Daily 3 AM | 30 days | Script + cron |
| Compose files + configs | Daily 3 AM | 90 days | Script + cron |
| Database dumps | Every 6 hours | 7 days | pg_dump/mysqldump |
| Full disk image | Monthly | 3 months | Clonezilla/dd |
| Offsite sync | Daily 5 AM | 60 days | rclone to B2/Wasabi |
### Backup Verification (Monthly)
- [ ] Pick a random backup from last week
- [ ] Restore to a test VM/container
- [ ] Verify data integrity (check file counts, DB row counts)
- [ ] Time the restore process (document RTO)
- [ ] Log results in backup-verification.md
---
## Phase 8: Monitoring & Alerting
### Monitoring Stack (Docker Compose)
```yaml
# monitoring/docker-compose.yml
services:
uptime-kuma:
image: louislam/uptime-kuma:1
container_name: uptime-kuma
restart: unless-stopped
volumes:
- uptime-data:/app/data
labels:
- "traefik.enable=true"
- "traefik.http.routers.uptime.rule=Host(`status.example.com`)"
prometheus:
image: prom/prometheus:v2.49.0
container_name: prometheus
restart: unless-stopped
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:10.3.0
container_name: grafana
restart: unless-stopped
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.0
container_name: cadvisor
restart: unless-stopped
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
volumes:
uptime-data:
prometheus-data:
grafana-data:
```
### Alert Rules
| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| Disk usage | >80% | >90% | Cleanup or expand |
| RAM usage | >85% | >95% | Identify memory leak, add RAM |
| CPU sustained | >80% 5min | >95% 5min | Check runaway process |
| Container restart | >2/hour | >5/hour | Check logs, fix root cause |
| SSL cert expiry | <14 days | <3 days | Renew cert |
| Backup age | >26 hours | >48 hours | Check backup script/cron |
| Service down | >2 min | >10 min | Investigate, restart |
### Notification Channels
| Channel | Service | Best For |
|---------|---------|----------|
| Push notification | ntfy.sh (self-hosted) | Mobile alerts |
| Chat | Discord/Slack webhook | Team alerts |
| Email | Uptime Kuma built-in | Formal notifications |
| Dashboard | Grafana + Uptime Kuma | Visual monitoring |
---
## Phase 9: Security Hardening
### Server Hardening Checklist
```bash
# 1. SSH hardening
# /etc/ssh/sshd_config
Port 2222 # Change default port
PermitRootLogin no # No root SSH
PasswordAuthentication no # Key-only
MaxAuthTries 3
AllowUsers yourusername
# 2. Install fail2ban
apt install fail2ban -y
systemctl enable fail2ban
# 3. Automatic security updates
apt install unattended-upgrades -y
dpkg-reconfigure -plow unattended-upgrades
# 4. Disable unused services
systemctl list-unit-files --state=enabled
# Disable anything you don't need
```
### Authentication Architecture
```
Internet → Traefik → Authelia/Authentik → Service
↓
Check: authenticated?
Yes → Forward to service
No → Redirect to login page + 2FA
```
**Authelia** (lightweight, YAML config) — good for smaller setups
**Authentik** (full IdP, web UI) — good for many users/services, SAML/OIDC
### Security Scoring (0-100)
| Dimension | Weight | Score Guide |
|-----------|--------|-------------|
| SSH hardened (keys, non-root, non-22) | 15 | 0=default, 15=fully hardened |
| Firewall active (deny-by-default) | 15 | 0=none, 15=UFW/iptables configured |
| Reverse proxy (no direct port exposure) | 15 | 0=ports exposed, 15=all behind proxy |
| SSL/TLS on all services | 10 | 0=HTTP, 10=HTTPS everywhere |
| Auth on all public services | 15 | 0=open, 15=SSO/2FA on everything |
| Container security (non-root, limits) | 10 | 0=default, 10=hardened |
| Auto-updates enabled | 10 | 0=manual, 10=automated |
| Secrets management (.env, not hardcoded) | 10 | 0=in compose, 10=.env + restricted perms |
**Score**: 0-40 = Vulnerable, 41-70 = Acceptable, 71-90 = Good, 91-100 = Hardened
---
## Phase 10: Maintenance & Updates
### Update Strategy
**Option A: Manual (Recommended for critical services)**
```bash
# Update script: /opt/stacks/scripts/update-all.sh
#!/bin/bash
set -euo pipefail
STACKS_DIR="/opt/stacks"
LOG="/var/log/docker-updates.log"
for stack in "$STACKS_DIR"/*/; do
if [ -f "$stack/docker-compose.yml" ]; then
echo "[$(date)] Updating $(basename $stack)..." | tee -a "$LOG"
cd "$stack"
docker compose pull 2>&1 | tee -a "$LOG"
docker compose up -d 2>&1 | tee -a "$LOG"
fi
done
docker image prune -f | tee -a "$LOG"
echo "[$(date)] Update complete" | tee -a "$LOG"
```
**Option B: Watchtower (Automated — use with caution)**
```yaml
services:
watchtower:
image: containrrr/watchtower:1.7.1
container_name: watchtower
restart: unless-stopped
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
- WATCHTOWER_SCHEDULE=0 0 4 * * MON # Monday 4 AM
- WATCHTOWER_CLEANUP=true
- WATCHTOWER_NOTIFICATIONS=shoutrrr
- WATCHTOWER_NOTIFICATION_URL=discord://webhook
- WATCHTOWER_LABEL_ENABLE=true # Only update labeled containers
# Add label to containers: com.centurylinklabs.watchtower.enable=true
```
### Weekly Maintenance Checklist
- [ ] Check Uptime Kuma for any downtime events
- [ ] Review disk usage (`df -h`)
- [ ] Check container health (`docker ps --filter health=unhealthy`)
- [ ] Review fail2ban bans (`fail2ban-client status`)
- [ ] Check backup logs (last successful backup)
- [ ] Review Docker logs for errors (`docker logs --since 7d <container>`)
- [ ] Prune unused resources (`docker system prune -f`)
### Monthly Maintenance
- [ ] Update all container images (read changelogs first!)
- [ ] Update host OS (`apt update && apt upgrade`)
- [ ] Test a backup restore
- [ ] Review and rotate secrets/passwords
- [ ] Check SSL certificate expiry dates
- [ ] Review Grafana dashboards for trends
- [ ] Clean up unused Docker networks/volumes
---
## Phase 11: Advanced Patterns
### Multi-Node Architecture
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ (Proxy/DNS) │────│ (Services) │────│ (NAS) │
│ Traefik │ │ Apps │ │ TrueNAS │
│ Pi-hole │ │ Databases │ │ NFS/SMB │
│ Authelia │ │ Media │ │ Backup │
└─────────────┘ └─────────────┘ └─────────────┘
↑ ↑ ↑
└───────── Tailscale Mesh ──────────────┘
```
### Docker Compose Includes (Compose v2.20+)
```yaml
# Shared fragments
include:
- path: ../common/traefik-labels.yml
- path: ../common/logging.yml
services:
app:
# inherits common configs
```
### GitOps for Homelab
```
homelab-configs/ # Git repo
├── .github/
│ └── workflows/
│ └── deploy.yml # CI: lint + push to server
├── stacks/
│ ├── traefik/
│ ├── monitoring/
│ └── media/
├── scripts/
└── README.md
```
**Workflow**: Edit compose locally → commit → push → CI deploys to server
**Tools**: Flux/ArgoCD (overkill), or simple `git pull && docker compose up -d` via webhook
### Hardware Redundancy
| Component | Solution | Cost |
|-----------|----------|------|
| Power | UPS (APC Back-UPS 600VA+) | $60-150 |
| Storage | RAID1/ZFS mirror (not RAID0!) | 2x disk cost |
| Network | Dual NIC, managed switch | $30-100 |
| Server | Second node (cold spare or active) | $100-400 |
**Rule**: RAID is NOT backup. It protects against disk failure only, not ransomware/deletion/corruption.
---
## Phase 12: Troubleshooting
### Common Issues Decision Tree
```
Service not accessible?
├── Can you ping the server? → No → Network/firewall issue
├── Is the container running? (`docker ps`) → No → Check logs: `docker logs <name>`
├── Is the port exposed? (`docker port <name>`) → No → Check compose ports/networks
├── Is Traefik routing? (Check Traefik dashboard) → No → Check labels, network
├── Is DNS resolving? (`dig app.example.com`) → No → Check DNS provider
└── SSL error? → Check acme.json permissions (chmod 600), cert resolver logs
```
### Docker Debug Commands
```bash
# Container not starting
docker logs <name> --tail 50
docker inspect <name> | jq '.[0].State'
# Network issues
docker network ls
docker network inspect <network>
docker exec <name> ping other-container
# Resource issues
docker stats # Live resource usage
docker system df # Disk usage
docker volume ls -f dangling=true # Orphaned volumes
# Nuclear options (use carefully)
docker compose down && docker compose up -d # Full restart
docker system prune -af --volumes # Clean EVERYTHING
```
### Performance Optimization
| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| Slow file access | HDD for database | Move DB to SSD |
| High CPU idle | Monitoring too frequent | Increase scrape intervals |
| OOM kills | No memory limits | Set `deploy.resources.limits.memory` |
| Slow Nextcloud | Missing Redis cache | Add Redis container |
| Jellyfin buffering | No hardware transcoding | Enable GPU passthrough |
| Slow Docker builds | No layer caching | Use multi-stage + .dockerignore |
---
## Service Configuration Quick Reference
### Vaultwarden (Password Manager)
```yaml
services:
vaultwarden:
image: vaultwarden/server:1.30.5
container_name: vaultwarden
restart: unless-stopped
volumes:
- vaultwarden-data:/data
environment:
- SIGNUPS_ALLOWED=false # Disable after creating your account
- WEBSOCKET_ENABLED=true
- ADMIN_TOKEN=${ADMIN_TOKEN} # Generate: openssl rand -base64 48
labels:
- "traefik.enable=true"
- "traefik.http.routers.vault.rule=Host(`vault.example.com`)"
```
### Immich (Photo Backup)
```yaml
# Use their official docker-compose.yml from:
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
# Key settings:
# - Set UPLOAD_LOCATION to a large storage mount
# - Enable hardware transcoding if GPU available
# - Set IMMICH_MACHINE_LEARNING_URL for face detection
```
### Paperless-ngx (Document Management)
```yaml
services:
paperless:
image: ghcr.io/paperless-ngx/paperless-ngx:2.4
container_name: paperless
restart: unless-stopped
volumes:
- paperless-data:/usr/src/paperless/data
- paperless-media:/usr/src/paperless/media
- ./consume:/usr/src/paperless/consume # Drop PDFs here
- ./export:/usr/src/paperless/export
environment:
- PAPERLESS_OCR_LANGUAGE=eng
- PAPERLESS_TIME_ZONE=Europe/London
- PAPERLESS_ADMIN_USER=${ADMIN_USER}
- PAPERLESS_ADMIN_PASSWORD=${ADMIN_PASS}
```
---
## Homelab Quality Rubric (0-100)
| Dimension | Weight | 0 (Poor) | 50 (Decent) | 100 (Excellent) |
|-----------|--------|----------|-------------|-----------------|
| Security | 20% | Default passwords, open ports | Firewall + SSL | Hardened SSH, SSO/2FA, no-new-privileges |
| Backups | 20% | None | Local only, untested | 3-2-1, automated, verified monthly |
| Monitoring | 15% | None | Uptime Kuma only | Full stack: metrics + logs + alerts |
| Documentation | 10% | Nothing written | README per stack | GitOps, full runbook, diagrams |
| Updates | 10% | Never updated | Manual quarterly | Scheduled weekly, changelogs reviewed |
| Reliability | 10% | Frequent crashes | Mostly stable | UPS, auto-restart, health checks |
| Performance | 10% | Slow, OOM kills | Adequate | Resource limits, SSD, HW transcoding |
| Scalability | 5% | Single machine, no plan | Compose organized | Multi-node ready, IaC |
---
## 10 Self-Hosting Mistakes
| # | Mistake | Fix |
|---|---------|-----|
| 1 | Using `:latest` tag | Pin versions: `image:1.2.3` |
| 2 | No backups | 3-2-1 backup rule, test restores |
| 3 | Exposing ports directly | Everything behind reverse proxy |
| 4 | Default passwords | Change immediately, use password manager |
| 5 | No monitoring | Uptime Kuma minimum, Grafana for depth |
| 6 | RAID = backup mentality | RAID protects disks, not data |
| 7 | Over-engineering day 1 | Start small, add complexity as needed |
| 8 | No documentation | Document every service, every port, every cron |
| 9 | Ignoring updates | Security patches matter, schedule updates |
| 10 | Running as root | Non-root containers, restricted SSH |
---
## Natural Language Commands
| Say | Agent Does |
|-----|-----------|
| "Set up a new service" | Guide through compose file creation with security best practices |
| "Audit my homelab security" | Run through security scoring checklist |
| "Plan my backup strategy" | Design 3-2-1 backup plan for your setup |
| "What should I self-host?" | Assess needs and recommend services by tier |
| "My container keeps crashing" | Walk through troubleshooting decision tree |
| "Help me set up Traefik" | Generate production Traefik config with SSL |
| "Compare NAS options" | Compare TrueNAS vs Unraid vs DIY for your needs |
| "Optimize my Docker setup" | Review compose files for security and performance |
| "Set up monitoring" | Deploy Uptime Kuma + Prometheus + Grafana stack |
| "Plan a hardware upgrade" | Assess current usage, recommend hardware by budget |
| "Migrate from cloud to self-hosted" | Plan migration with data export and service mapping |
| "Set up remote access" | Compare and deploy VPN/Tailscale for secure remote access |Related Skills
self-improvement
Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.
Presentation Mastery — Complete Slide Design & Delivery System
You are a Presentation Architect. You help build presentations that persuade, inform, and move people to action. You cover the full lifecycle: audience analysis → narrative structure → slide design → delivery coaching → post-presentation follow-up.
Photography Mastery
Complete photography system — exposure, composition, lighting, genre-specific workflows, editing, gear selection, portfolio building, and client management. From beginner to professional.
afrexai-personal-finance
Complete personal finance system — budgeting, debt payoff, investing, tax optimization, net worth tracking, and financial independence planning. Use when managing money, building wealth, paying off debt, planning retirement, or optimizing taxes. Zero dependencies.
afrexai-performance-engineering
Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applications, optimizing code/queries/infrastructure, load testing before launch, planning capacity, or building performance into CI/CD. Covers Node.js, Python, Go, Java, databases, APIs, and frontend.
OpenClaw Mastery — The Complete Agent Engineering & Operations System
> Built by AfrexAI — the team that runs 9+ production agents 24/7 on OpenClaw.
afrexai-okr-engine
Complete OKR & Strategy Execution system — from company vision to weekly execution. Covers goal hierarchy, OKR writing methodology, scoring rubrics, alignment cascading, KPI dashboards, review cadences, team accountability, and quarterly planning rituals. Use when setting goals, running planning cycles, tracking OKRs, building KPI dashboards, running retrospectives, or aligning team work to strategy. Trigger on: "OKR", "objectives", "key results", "goal setting", "quarterly planning", "KPIs", "strategy execution", "annual planning", "team goals", "alignment", "review cadence", "what should we focus on", "prioritize", "goal tracking", "north star metric".
afrexai-observability-engine
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization.
Negotiation Mastery
Complete negotiation system for business deals, salary talks, vendor contracts, partnerships, and high-stakes conversations. Combines multiple proven frameworks (FBI tactical empathy, Harvard principled negotiation, SPIN, anchoring science) into one actionable playbook.
n8n Workflow Mastery — Complete Automation Engineering System
You are an expert n8n workflow architect. You design, build, debug, optimize, and scale n8n automations following production-grade methodology. Every workflow you create is complete, functional, and follows the patterns in this guide.
Meeting Mastery — AI Meeting Prep, Notes & Follow-Up Engine
You are an elite meeting preparation and follow-up agent. You ensure every meeting is high-value — thoroughly prepared beforehand, cleanly documented during, and actioned after.
afrexai-lead-hunter
Enterprise-grade B2B lead generation, enrichment, scoring, and outreach sequencing for AI agents. Find ideal prospects, enrich with verified data, score against your ICP, and generate personalized outreach — all autonomously.