multiAI Summary Pending

afrexai-self-hosting-mastery

Complete self-hosting and homelab operating system. Deploy, secure, monitor, and maintain self-hosted services with production-grade reliability. Use when setting up home servers, Docker infrastructure, reverse proxies, backups, monitoring, or evaluating self-hosted alternatives to SaaS.

3,556 stars

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/afrexai-self-hosting-mastery/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/1kalin/afrexai-self-hosting-mastery/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/afrexai-self-hosting-mastery/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How afrexai-self-hosting-mastery Compares

Feature / Agentafrexai-self-hosting-masteryStandard Approach
Platform SupportmultiLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Complete self-hosting and homelab operating system. Deploy, secure, monitor, and maintain self-hosted services with production-grade reliability. Use when setting up home servers, Docker infrastructure, reverse proxies, backups, monitoring, or evaluating self-hosted alternatives to SaaS.

Which AI agents support this skill?

This skill is compatible with multi.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Self-Hosting Mastery

Complete system for building and operating reliable self-hosted infrastructure — from first server to multi-node homelab.

## Phase 1: Infrastructure Assessment

### Server Profile YAML

```yaml
server_profile:
  name: ""
  hardware:
    cpu: ""              # e.g., "Intel i5-12400" or "Raspberry Pi 5"
    ram_gb: 0
    storage:
      - device: ""       # e.g., "/dev/sda"
        type: ""         # ssd | hdd | nvme
        size_gb: 0
        role: ""         # boot | data | backup
    network: ""          # 1gbe | 2.5gbe | 10gbe
  os: ""                 # debian | ubuntu | proxmox | unraid | truenas
  location: ""           # home | closet | rack | colo | vps
  power:
    ups: false
    wattage_idle: 0
    wattage_load: 0
    monthly_cost_estimate: ""  # electricity
  network:
    public_ip: ""        # static | dynamic | cgnat
    domain: ""
    dns_provider: ""     # cloudflare | duckdns | custom
    isp_ports_open: true # some ISPs block 80/443
  goals:
    - ""                 # media server, smart home, dev environment, etc.
  budget_monthly: ""     # electricity + domain + any VPS
```

### Hardware Decision Matrix

| Budget | RAM | Storage | Good For | Example Hardware |
|--------|-----|---------|----------|-----------------|
| $0 | 4-8GB | 64GB+ | Pi-hole, AdGuard, small tools | Raspberry Pi 4/5 |
| $50-150 | 8-16GB | 256GB+ | Docker host, 5-10 services | Used SFF PC (Dell Optiplex, Lenovo Tiny) |
| $150-400 | 16-32GB | 1TB+ | NAS + services, media server | Mini PC (Intel NUC, Beelink) |
| $400-800 | 32-64GB | 4TB+ | Full homelab, VMs + containers | Used enterprise (Dell R720, HP DL380) |
| $800+ | 64GB+ | 10TB+ | Multi-node, Proxmox cluster | Multiple nodes, dedicated NAS |

### Self-Host vs SaaS Decision

Ask before self-hosting anything:
1. **Data sensitivity** — Does keeping data local matter? (passwords, health, finance = yes)
2. **Reliability need** — Can you tolerate occasional downtime? (email = risky, media = fine)
3. **Maintenance budget** — Do you have 2-4 hours/month for updates?
4. **Skill level** — Can you debug Docker/networking issues?
5. **Cost comparison** — Is the SaaS < $10/mo? Often not worth self-hosting for trivial savings.

**Always self-host**: Password manager, DNS/ad-blocking, VPN, bookmarks, notes
**Usually self-host**: Media server, file sync, photo backup, monitoring, git
**Think twice**: Email (deliverability hell), calendar (sync complexity), chat (uptime expectations)
**Rarely worth it**: Search engine (resource hungry), social media (no network effect)

---

## Phase 2: OS & Virtualization

### OS Selection Guide

| OS | Best For | Learning Curve | Notes |
|----|----------|---------------|-------|
| Debian 12 | Docker-only host | Low | Stable, minimal, just works |
| Ubuntu Server 24.04 | Beginners, wide docs | Low | More packages, snap controversy |
| Proxmox VE | VMs + containers | Medium | Free, enterprise features, ZFS |
| Unraid | NAS + Docker + VMs | Medium | $59-129, great UI, parity array |
| TrueNAS Scale | ZFS NAS + Docker | Medium | Free, ZFS-first, apps improving |
| NixOS | Reproducible configs | High | Declarative, steep learning curve |

### Proxmox Quick Setup

```bash
# Post-install essentials
# 1. Remove enterprise repo (if no subscription)
sed -i 's/^deb/#deb/' /etc/apt/sources.list.d/pve-enterprise.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-no-subscription.list
apt update && apt upgrade -y

# 2. Create a Docker LXC (lightweight container)
# Download template: Datacenter → Storage → CT Templates → Download → debian-12
# Create CT: 2 cores, 2GB RAM, 32GB disk, bridge vmbr0
# Inside CT: install Docker
apt install -y curl
curl -fsSL https://get.docker.com | sh

# 3. Enable IOMMU for GPU passthrough (if needed)
# Edit /etc/default/grub: GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
# update-grub && reboot
```

### VM vs LXC vs Docker Decision

| Factor | VM | LXC | Docker |
|--------|----|-----|--------|
| Isolation | Full (own kernel) | Partial (shared kernel) | Process-level |
| Overhead | High (1-2GB base) | Low (50-200MB) | Minimal |
| Use when | Different OS, GPU passthrough, untrusted workloads | Dedicated service host, ZFS datasets | Most services |
| Avoid when | RAM-constrained | Need Windows, custom kernel | Stateful databases (use LXC/VM) |

**Rule**: Docker for 90% of services. LXC for Docker hosts or isolated environments. VM for Windows, different kernel needs, or GPU passthrough.

---

## Phase 3: Docker Infrastructure

### Docker Compose Project Structure

```
/opt/stacks/           # or ~/docker/
├── traefik/
│   ├── docker-compose.yml
│   ├── .env
│   ├── config/
│   │   └── traefik.yml
│   └── data/
│       ├── acme.json          # chmod 600
│       └── dynamic/
├── monitoring/
│   ├── docker-compose.yml
│   ├── .env
│   └── config/
├── media/
│   ├── docker-compose.yml
│   ├── .env
│   └── config/
├── productivity/
│   ├── docker-compose.yml
│   ├── .env
│   └── config/
└── scripts/
    ├── backup.sh
    ├── update-all.sh
    └── health-check.sh
```

### Docker Compose Best Practices

```yaml
# Template: production-grade service
services:
  app:
    image: vendor/app:1.2.3           # ALWAYS pin version
    container_name: app               # Explicit name
    restart: unless-stopped           # Auto-restart
    networks:
      - proxy                         # Traefik network
      - internal                      # Backend network
    volumes:
      - ./config:/config              # Bind mount for config
      - app-data:/data                # Named volume for data
    environment:
      - TZ=Europe/London              # Always set timezone
      - PUID=1000                     # Match host user
      - PGID=1000
    env_file:
      - .env                          # Secrets in .env (gitignored)
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.app.rule=Host(`app.example.com`)"
      - "traefik.http.routers.app.tls.certresolver=letsencrypt"
      - "traefik.http.services.app.loadbalancer.server.port=8080"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    deploy:
      resources:
        limits:
          memory: 512M               # Prevent OOM cascades
    security_opt:
      - no-new-privileges:true        # Security hardening
    read_only: true                   # Where possible
    tmpfs:
      - /tmp

volumes:
  app-data:

networks:
  proxy:
    external: true
  internal:
```

### Docker Security Checklist

- [ ] Pin all image versions (never `:latest` in production)
- [ ] Set `restart: unless-stopped` on all services
- [ ] Use `.env` files for secrets (never hardcode in compose)
- [ ] Set memory limits on all containers
- [ ] Use `security_opt: no-new-privileges:true`
- [ ] Use `read_only: true` where possible + tmpfs for /tmp
- [ ] Create separate Docker networks per stack
- [ ] Never expose database ports to 0.0.0.0
- [ ] Run containers as non-root (PUID/PGID or `user:`)
- [ ] Enable Docker content trust: `export DOCKER_CONTENT_TRUST=1`
- [ ] Prune unused images/volumes monthly: `docker system prune -af`
- [ ] Use named volumes (not anonymous) for all persistent data
- [ ] Set `TZ` environment variable on every container

---

## Phase 4: Reverse Proxy & SSL

### Reverse Proxy Selection

| Proxy | Best For | SSL | Config Style | Learning Curve |
|-------|----------|-----|-------------|---------------|
| Traefik | Docker-native, auto-discovery | Auto (ACME) | Labels + YAML | Medium |
| Caddy | Simplicity, auto-SSL | Auto (built-in) | Caddyfile | Low |
| Nginx Proxy Manager | GUI preference | Auto (UI) | Web UI | Very Low |
| Nginx (manual) | Maximum control | Manual/certbot | Config files | High |

**Recommendation**: Traefik for Docker power users. Caddy for simplicity. NPM for beginners.

### Traefik Production Config

```yaml
# traefik/config/traefik.yml
api:
  dashboard: true
  insecure: false

entryPoints:
  web:
    address: ":80"
    http:
      redirections:
        entryPoint:
          to: websecure
          scheme: https
  websecure:
    address: ":443"
    http:
      tls:
        certResolver: letsencrypt

certificatesResolvers:
  letsencrypt:
    acme:
      email: you@example.com
      storage: /data/acme.json
      # Use DNS challenge if ISP blocks port 80
      # dnsChallenge:
      #   provider: cloudflare
      httpChallenge:
        entryPoint: web

providers:
  docker:
    exposedByDefault: false    # Explicit opt-in per service
    network: proxy
  file:
    directory: /data/dynamic
    watch: true

log:
  level: WARN

accessLog:
  filePath: /data/access.log
  bufferingSize: 100
```

### Cloudflare Tunnel (Zero Port Forwarding)

For CGNAT or ISPs blocking ports — expose services without opening firewall:

```yaml
# cloudflared/docker-compose.yml
services:
  cloudflared:
    image: cloudflare/cloudflared:2024.1.0
    container_name: cloudflared
    restart: unless-stopped
    command: tunnel run
    environment:
      - TUNNEL_TOKEN=${CF_TUNNEL_TOKEN}
    networks:
      - proxy
```

**When to use Cloudflare Tunnel vs port forwarding**:
- CGNAT (no public IP) → Tunnel (only option)
- ISP blocks 80/443 → Tunnel or DNS challenge + non-standard ports
- Security-first → Tunnel (no open ports)
- Performance-first → Direct (lower latency)
- LAN-only access → Neither (use Tailscale/WireGuard)

---

## Phase 5: Essential Services Stack

### Tier 1 — Deploy First (Foundation)

| Service | Purpose | Image | RAM | Notes |
|---------|---------|-------|-----|-------|
| Traefik/Caddy | Reverse proxy + SSL | traefik:v3.0 | 64MB | Gateway to everything |
| Pi-hole/AdGuard | DNS + ad blocking | pihole/pihole | 128MB | Network-wide ad blocking |
| Authelia/Authentik | SSO + 2FA | authelia/authelia | 128MB | Protect services without built-in auth |
| Uptime Kuma | Monitoring | louislam/uptime-kuma | 128MB | Know when things break |
| Watchtower | Auto-updates | containrrr/watchtower | 32MB | Optional — some prefer manual |

### Tier 2 — Core Services

| Service | Purpose | Alt | RAM |
|---------|---------|-----|-----|
| Vaultwarden | Password manager | Bitwarden | 64MB |
| Nextcloud | File sync + office | Seafile (lighter) | 512MB |
| Immich | Photo backup | PhotoPrism | 1-4GB |
| Jellyfin | Media server | Plex (less free) | 512MB-2GB |
| Paperless-ngx | Document management | - | 256MB |
| Home Assistant | Smart home | - | 512MB |

### Tier 3 — Power User

| Service | Purpose | RAM |
|---------|---------|-----|
| Gitea/Forgejo | Git hosting | 256MB |
| n8n | Workflow automation | 256MB |
| Grafana + Prometheus | Metrics & dashboards | 512MB |
| Tandoor | Recipe management | 256MB |
| Mealie | Meal planning | 128MB |
| Linkwarden/Hoarder | Bookmark manager | 256MB |
| Stirling PDF | PDF tools | 512MB |
| IT-Tools | Developer utilities | 64MB |

### RAM Planning

```
Total RAM needed ≈ OS base (1-2GB) + sum of service RAM + 20% headroom
Example 16GB server:
  OS + Docker:     2 GB
  Traefik:         0.1 GB
  Pi-hole:         0.1 GB
  Authelia:        0.1 GB
  Uptime Kuma:     0.1 GB
  Vaultwarden:     0.1 GB
  Nextcloud:       0.5 GB
  Immich:          2.0 GB
  Jellyfin:        1.0 GB
  Paperless:       0.3 GB
  Home Assistant:  0.5 GB
  ──────────────────────
  Total:           6.8 GB → 8.2 GB with headroom
  Available:       ~7.8 GB free for more services
```

---

## Phase 6: Networking & DNS

### DNS Architecture

```
Internet → Cloudflare DNS → Your Public IP → Router → Server
                                                        ↓
                                             Reverse Proxy (Traefik)
                                                        ↓
                                     ┌──────────────────┼──────────────────┐
                                     ↓                  ↓                  ↓
                                app.domain.com   files.domain.com   media.domain.com
```

### Split DNS (Access Services Locally Without Hairpin NAT)

```
# Pi-hole/AdGuard: Local DNS rewrites
# Point *.home.example.com → 192.168.1.100 (server LAN IP)
# External: Cloudflare points to public IP
# Result: LAN traffic stays local, external goes through internet
```

### VPN for Remote Access

| Solution | Type | Best For | Complexity |
|----------|------|----------|-----------|
| Tailscale | Mesh VPN | Easiest setup, multi-device | Very Low |
| WireGuard | Point-to-point | Performance, full control | Medium |
| Headscale | Self-hosted Tailscale | Privacy, no vendor lock | Medium-High |

**Recommendation**: Start with Tailscale (free for 3 users). Move to Headscale when you want full control.

### Firewall Rules (UFW)

```bash
# Default deny incoming
ufw default deny incoming
ufw default allow outgoing

# Allow SSH (change port from 22!)
ufw allow 2222/tcp comment 'SSH'

# Allow HTTP/HTTPS for reverse proxy
ufw allow 80/tcp comment 'HTTP redirect'
ufw allow 443/tcp comment 'HTTPS'

# Allow local network for discovery
ufw allow from 192.168.1.0/24 comment 'LAN'

# Enable
ufw enable
```

---

## Phase 7: Backup Strategy

### 3-2-1 Rule Implementation

```
3 copies:  Live data + Local backup + Remote backup
2 media:   SSD/HDD (server) + External drive or NAS
1 offsite: Cloud (Backblaze B2, Wasabi) or second location
```

### Backup Script Template

```bash
#!/bin/bash
# /opt/stacks/scripts/backup.sh
set -euo pipefail

BACKUP_DIR="/mnt/backup/docker"
STACKS_DIR="/opt/stacks"
DATE=$(date +%Y-%m-%d_%H%M)
RETENTION_DAYS=30

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }

# 1. Stop services that need consistent backups
log "Stopping database services..."
cd "$STACKS_DIR/productivity" && docker compose stop db

# 2. Backup Docker volumes
log "Backing up volumes..."
for vol in $(docker volume ls -q); do
    docker run --rm \
        -v "$vol":/source:ro \
        -v "$BACKUP_DIR/volumes":/backup \
        alpine tar czf "/backup/${vol}_${DATE}.tar.gz" -C /source .
done

# 3. Backup compose files and configs
log "Backing up configs..."
tar czf "$BACKUP_DIR/configs/stacks_${DATE}.tar.gz" \
    --exclude='*.log' \
    --exclude='node_modules' \
    "$STACKS_DIR"

# 4. Restart services
log "Restarting services..."
cd "$STACKS_DIR/productivity" && docker compose start db

# 5. Cleanup old backups
log "Cleaning up backups older than ${RETENTION_DAYS} days..."
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete

# 6. Sync to remote (Backblaze B2 example)
# rclone sync "$BACKUP_DIR" b2:my-backups/docker/ --transfers 4

# 7. Verify
BACKUP_SIZE=$(du -sh "$BACKUP_DIR" | cut -f1)
log "Backup complete. Total size: $BACKUP_SIZE"

# 8. Send notification (optional)
# curl -s "https://ntfy.sh/my-backups" -d "Backup complete: $BACKUP_SIZE"
```

### Backup Schedule

| What | Frequency | Retention | Method |
|------|-----------|-----------|--------|
| Docker volumes | Daily 3 AM | 30 days | Script + cron |
| Compose files + configs | Daily 3 AM | 90 days | Script + cron |
| Database dumps | Every 6 hours | 7 days | pg_dump/mysqldump |
| Full disk image | Monthly | 3 months | Clonezilla/dd |
| Offsite sync | Daily 5 AM | 60 days | rclone to B2/Wasabi |

### Backup Verification (Monthly)

- [ ] Pick a random backup from last week
- [ ] Restore to a test VM/container
- [ ] Verify data integrity (check file counts, DB row counts)
- [ ] Time the restore process (document RTO)
- [ ] Log results in backup-verification.md

---

## Phase 8: Monitoring & Alerting

### Monitoring Stack (Docker Compose)

```yaml
# monitoring/docker-compose.yml
services:
  uptime-kuma:
    image: louislam/uptime-kuma:1
    container_name: uptime-kuma
    restart: unless-stopped
    volumes:
      - uptime-data:/app/data
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.uptime.rule=Host(`status.example.com`)"

  prometheus:
    image: prom/prometheus:v2.49.0
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:10.3.0
    container_name: grafana
    restart: unless-stopped
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.0
    container_name: cadvisor
    restart: unless-stopped
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

volumes:
  uptime-data:
  prometheus-data:
  grafana-data:
```

### Alert Rules

| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| Disk usage | >80% | >90% | Cleanup or expand |
| RAM usage | >85% | >95% | Identify memory leak, add RAM |
| CPU sustained | >80% 5min | >95% 5min | Check runaway process |
| Container restart | >2/hour | >5/hour | Check logs, fix root cause |
| SSL cert expiry | <14 days | <3 days | Renew cert |
| Backup age | >26 hours | >48 hours | Check backup script/cron |
| Service down | >2 min | >10 min | Investigate, restart |

### Notification Channels

| Channel | Service | Best For |
|---------|---------|----------|
| Push notification | ntfy.sh (self-hosted) | Mobile alerts |
| Chat | Discord/Slack webhook | Team alerts |
| Email | Uptime Kuma built-in | Formal notifications |
| Dashboard | Grafana + Uptime Kuma | Visual monitoring |

---

## Phase 9: Security Hardening

### Server Hardening Checklist

```bash
# 1. SSH hardening
# /etc/ssh/sshd_config
Port 2222                          # Change default port
PermitRootLogin no                 # No root SSH
PasswordAuthentication no          # Key-only
MaxAuthTries 3
AllowUsers yourusername

# 2. Install fail2ban
apt install fail2ban -y
systemctl enable fail2ban

# 3. Automatic security updates
apt install unattended-upgrades -y
dpkg-reconfigure -plow unattended-upgrades

# 4. Disable unused services
systemctl list-unit-files --state=enabled
# Disable anything you don't need
```

### Authentication Architecture

```
Internet → Traefik → Authelia/Authentik → Service
                         ↓
                    Check: authenticated?
                    Yes → Forward to service
                    No → Redirect to login page + 2FA
```

**Authelia** (lightweight, YAML config) — good for smaller setups
**Authentik** (full IdP, web UI) — good for many users/services, SAML/OIDC

### Security Scoring (0-100)

| Dimension | Weight | Score Guide |
|-----------|--------|-------------|
| SSH hardened (keys, non-root, non-22) | 15 | 0=default, 15=fully hardened |
| Firewall active (deny-by-default) | 15 | 0=none, 15=UFW/iptables configured |
| Reverse proxy (no direct port exposure) | 15 | 0=ports exposed, 15=all behind proxy |
| SSL/TLS on all services | 10 | 0=HTTP, 10=HTTPS everywhere |
| Auth on all public services | 15 | 0=open, 15=SSO/2FA on everything |
| Container security (non-root, limits) | 10 | 0=default, 10=hardened |
| Auto-updates enabled | 10 | 0=manual, 10=automated |
| Secrets management (.env, not hardcoded) | 10 | 0=in compose, 10=.env + restricted perms |

**Score**: 0-40 = Vulnerable, 41-70 = Acceptable, 71-90 = Good, 91-100 = Hardened

---

## Phase 10: Maintenance & Updates

### Update Strategy

**Option A: Manual (Recommended for critical services)**
```bash
# Update script: /opt/stacks/scripts/update-all.sh
#!/bin/bash
set -euo pipefail

STACKS_DIR="/opt/stacks"
LOG="/var/log/docker-updates.log"

for stack in "$STACKS_DIR"/*/; do
    if [ -f "$stack/docker-compose.yml" ]; then
        echo "[$(date)] Updating $(basename $stack)..." | tee -a "$LOG"
        cd "$stack"
        docker compose pull 2>&1 | tee -a "$LOG"
        docker compose up -d 2>&1 | tee -a "$LOG"
    fi
done

docker image prune -f | tee -a "$LOG"
echo "[$(date)] Update complete" | tee -a "$LOG"
```

**Option B: Watchtower (Automated — use with caution)**
```yaml
services:
  watchtower:
    image: containrrr/watchtower:1.7.1
    container_name: watchtower
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - WATCHTOWER_SCHEDULE=0 0 4 * * MON  # Monday 4 AM
      - WATCHTOWER_CLEANUP=true
      - WATCHTOWER_NOTIFICATIONS=shoutrrr
      - WATCHTOWER_NOTIFICATION_URL=discord://webhook
      - WATCHTOWER_LABEL_ENABLE=true    # Only update labeled containers
    # Add label to containers: com.centurylinklabs.watchtower.enable=true
```

### Weekly Maintenance Checklist

- [ ] Check Uptime Kuma for any downtime events
- [ ] Review disk usage (`df -h`)
- [ ] Check container health (`docker ps --filter health=unhealthy`)
- [ ] Review fail2ban bans (`fail2ban-client status`)
- [ ] Check backup logs (last successful backup)
- [ ] Review Docker logs for errors (`docker logs --since 7d <container>`)
- [ ] Prune unused resources (`docker system prune -f`)

### Monthly Maintenance

- [ ] Update all container images (read changelogs first!)
- [ ] Update host OS (`apt update && apt upgrade`)
- [ ] Test a backup restore
- [ ] Review and rotate secrets/passwords
- [ ] Check SSL certificate expiry dates
- [ ] Review Grafana dashboards for trends
- [ ] Clean up unused Docker networks/volumes

---

## Phase 11: Advanced Patterns

### Multi-Node Architecture

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Node 1    │     │   Node 2    │     │   Node 3    │
│ (Proxy/DNS) │────│ (Services)  │────│   (NAS)     │
│ Traefik     │     │ Apps        │     │ TrueNAS     │
│ Pi-hole     │     │ Databases   │     │ NFS/SMB     │
│ Authelia    │     │ Media       │     │ Backup      │
└─────────────┘     └─────────────┘     └─────────────┘
       ↑                   ↑                   ↑
       └───────── Tailscale Mesh ──────────────┘
```

### Docker Compose Includes (Compose v2.20+)

```yaml
# Shared fragments
include:
  - path: ../common/traefik-labels.yml
  - path: ../common/logging.yml

services:
  app:
    # inherits common configs
```

### GitOps for Homelab

```
homelab-configs/           # Git repo
├── .github/
│   └── workflows/
│       └── deploy.yml     # CI: lint + push to server
├── stacks/
│   ├── traefik/
│   ├── monitoring/
│   └── media/
├── scripts/
└── README.md
```

**Workflow**: Edit compose locally → commit → push → CI deploys to server
**Tools**: Flux/ArgoCD (overkill), or simple `git pull && docker compose up -d` via webhook

### Hardware Redundancy

| Component | Solution | Cost |
|-----------|----------|------|
| Power | UPS (APC Back-UPS 600VA+) | $60-150 |
| Storage | RAID1/ZFS mirror (not RAID0!) | 2x disk cost |
| Network | Dual NIC, managed switch | $30-100 |
| Server | Second node (cold spare or active) | $100-400 |

**Rule**: RAID is NOT backup. It protects against disk failure only, not ransomware/deletion/corruption.

---

## Phase 12: Troubleshooting

### Common Issues Decision Tree

```
Service not accessible?
├── Can you ping the server? → No → Network/firewall issue
├── Is the container running? (`docker ps`) → No → Check logs: `docker logs <name>`
├── Is the port exposed? (`docker port <name>`) → No → Check compose ports/networks
├── Is Traefik routing? (Check Traefik dashboard) → No → Check labels, network
├── Is DNS resolving? (`dig app.example.com`) → No → Check DNS provider
└── SSL error? → Check acme.json permissions (chmod 600), cert resolver logs
```

### Docker Debug Commands

```bash
# Container not starting
docker logs <name> --tail 50
docker inspect <name> | jq '.[0].State'

# Network issues
docker network ls
docker network inspect <network>
docker exec <name> ping other-container

# Resource issues
docker stats                          # Live resource usage
docker system df                      # Disk usage
docker volume ls -f dangling=true     # Orphaned volumes

# Nuclear options (use carefully)
docker compose down && docker compose up -d    # Full restart
docker system prune -af --volumes              # Clean EVERYTHING
```

### Performance Optimization

| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| Slow file access | HDD for database | Move DB to SSD |
| High CPU idle | Monitoring too frequent | Increase scrape intervals |
| OOM kills | No memory limits | Set `deploy.resources.limits.memory` |
| Slow Nextcloud | Missing Redis cache | Add Redis container |
| Jellyfin buffering | No hardware transcoding | Enable GPU passthrough |
| Slow Docker builds | No layer caching | Use multi-stage + .dockerignore |

---

## Service Configuration Quick Reference

### Vaultwarden (Password Manager)

```yaml
services:
  vaultwarden:
    image: vaultwarden/server:1.30.5
    container_name: vaultwarden
    restart: unless-stopped
    volumes:
      - vaultwarden-data:/data
    environment:
      - SIGNUPS_ALLOWED=false       # Disable after creating your account
      - WEBSOCKET_ENABLED=true
      - ADMIN_TOKEN=${ADMIN_TOKEN}  # Generate: openssl rand -base64 48
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.vault.rule=Host(`vault.example.com`)"
```

### Immich (Photo Backup)

```yaml
# Use their official docker-compose.yml from:
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
# Key settings:
# - Set UPLOAD_LOCATION to a large storage mount
# - Enable hardware transcoding if GPU available
# - Set IMMICH_MACHINE_LEARNING_URL for face detection
```

### Paperless-ngx (Document Management)

```yaml
services:
  paperless:
    image: ghcr.io/paperless-ngx/paperless-ngx:2.4
    container_name: paperless
    restart: unless-stopped
    volumes:
      - paperless-data:/usr/src/paperless/data
      - paperless-media:/usr/src/paperless/media
      - ./consume:/usr/src/paperless/consume  # Drop PDFs here
      - ./export:/usr/src/paperless/export
    environment:
      - PAPERLESS_OCR_LANGUAGE=eng
      - PAPERLESS_TIME_ZONE=Europe/London
      - PAPERLESS_ADMIN_USER=${ADMIN_USER}
      - PAPERLESS_ADMIN_PASSWORD=${ADMIN_PASS}
```

---

## Homelab Quality Rubric (0-100)

| Dimension | Weight | 0 (Poor) | 50 (Decent) | 100 (Excellent) |
|-----------|--------|----------|-------------|-----------------|
| Security | 20% | Default passwords, open ports | Firewall + SSL | Hardened SSH, SSO/2FA, no-new-privileges |
| Backups | 20% | None | Local only, untested | 3-2-1, automated, verified monthly |
| Monitoring | 15% | None | Uptime Kuma only | Full stack: metrics + logs + alerts |
| Documentation | 10% | Nothing written | README per stack | GitOps, full runbook, diagrams |
| Updates | 10% | Never updated | Manual quarterly | Scheduled weekly, changelogs reviewed |
| Reliability | 10% | Frequent crashes | Mostly stable | UPS, auto-restart, health checks |
| Performance | 10% | Slow, OOM kills | Adequate | Resource limits, SSD, HW transcoding |
| Scalability | 5% | Single machine, no plan | Compose organized | Multi-node ready, IaC |

---

## 10 Self-Hosting Mistakes

| # | Mistake | Fix |
|---|---------|-----|
| 1 | Using `:latest` tag | Pin versions: `image:1.2.3` |
| 2 | No backups | 3-2-1 backup rule, test restores |
| 3 | Exposing ports directly | Everything behind reverse proxy |
| 4 | Default passwords | Change immediately, use password manager |
| 5 | No monitoring | Uptime Kuma minimum, Grafana for depth |
| 6 | RAID = backup mentality | RAID protects disks, not data |
| 7 | Over-engineering day 1 | Start small, add complexity as needed |
| 8 | No documentation | Document every service, every port, every cron |
| 9 | Ignoring updates | Security patches matter, schedule updates |
| 10 | Running as root | Non-root containers, restricted SSH |

---

## Natural Language Commands

| Say | Agent Does |
|-----|-----------|
| "Set up a new service" | Guide through compose file creation with security best practices |
| "Audit my homelab security" | Run through security scoring checklist |
| "Plan my backup strategy" | Design 3-2-1 backup plan for your setup |
| "What should I self-host?" | Assess needs and recommend services by tier |
| "My container keeps crashing" | Walk through troubleshooting decision tree |
| "Help me set up Traefik" | Generate production Traefik config with SSL |
| "Compare NAS options" | Compare TrueNAS vs Unraid vs DIY for your needs |
| "Optimize my Docker setup" | Review compose files for security and performance |
| "Set up monitoring" | Deploy Uptime Kuma + Prometheus + Grafana stack |
| "Plan a hardware upgrade" | Assess current usage, recommend hardware by budget |
| "Migrate from cloud to self-hosted" | Plan migration with data export and service mapping |
| "Set up remote access" | Compare and deploy VPN/Tailscale for secure remote access |