cluster-admin

Master Kubernetes cluster administration, from initial setup through production management. Learn cluster installation, scaling, upgrades, and HA strategies.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

cluster-admin is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Master Kubernetes cluster administration, from initial setup through production management. Learn cluster installation, scaling, upgrades, and HA strategies.

Teams using cluster-admin should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/cluster-admin/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/devops/cluster-admin/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/cluster-admin/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How cluster-admin Compares

Feature / Agent	cluster-admin	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Master Kubernetes cluster administration, from initial setup through production management. Learn cluster installation, scaling, upgrades, and HA strategies.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Cluster Administration

## Executive Summary
Production-grade Kubernetes cluster administration covering the complete lifecycle from initial deployment to day-2 operations. This skill provides deep expertise in cluster architecture, high availability configurations, upgrade strategies, and operational best practices aligned with CKA/CKS certification standards.

## Core Competencies

### 1. Cluster Architecture Mastery

**Control Plane Components**
```
┌─────────────────────────────────────────────────────────────────┐
│                      CONTROL PLANE                               │
├─────────────┬─────────────┬──────────────┬────────────────────┤
│ API Server  │ Scheduler   │ Controller   │ etcd               │
│             │             │ Manager      │                    │
│ - AuthN     │ - Pod       │ - ReplicaSet │ - Cluster state    │
│ - AuthZ     │   placement │ - Endpoints  │ - 3+ nodes for HA  │
│ - Admission │ - Node      │ - Namespace  │ - Regular backups  │
│   control   │   affinity  │ - ServiceAcc │ - Encryption       │
└─────────────┴─────────────┴──────────────┴────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      WORKER NODES                                │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ kubelet         │ kube-proxy      │ Container Runtime           │
│ - Pod lifecycle │ - iptables/ipvs │ - containerd (recommended)  │
│ - Node status   │ - Service VIPs  │ - CRI-O                     │
│ - Volume mount  │ - Load balance  │ - gVisor (sandboxed)        │
└─────────────────┴─────────────────┴─────────────────────────────┘
```

**Production Cluster Bootstrap (kubeadm)**
```bash
# Initialize control plane with HA
sudo kubeadm init \
  --control-plane-endpoint "k8s-api.example.com:6443" \
  --upload-certs \
  --pod-network-cidr=10.244.0.0/16 \
  --service-cidr=10.96.0.0/12 \
  --apiserver-advertise-address=0.0.0.0 \
  --apiserver-cert-extra-sans=k8s-api.example.com

# Join additional control plane nodes
kubeadm join k8s-api.example.com:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash> \
  --control-plane \
  --certificate-key <cert-key>

# Join worker nodes
kubeadm join k8s-api.example.com:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash>
```

### 2. Node Management

**Node Lifecycle Operations**
```bash
# View node details with resource usage
kubectl get nodes -o wide
kubectl top nodes

# Label nodes for workload placement
kubectl label nodes worker-01 node-type=compute tier=production
kubectl label nodes worker-02 node-type=gpu accelerator=nvidia-a100

# Taint nodes for dedicated workloads
kubectl taint nodes worker-gpu dedicated=gpu:NoSchedule

# Cordon node (prevent new pods)
kubectl cordon worker-03

# Drain node safely (for maintenance)
kubectl drain worker-03 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300 \
  --timeout=600s

# Return node to service
kubectl uncordon worker-03
```

**Node Problem Detector Configuration**
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: node-problem-detector
  template:
    metadata:
      labels:
        app: node-problem-detector
    spec:
      containers:
      - name: node-problem-detector
        image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.14
        securityContext:
          privileged: true
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: log
          mountPath: /var/log
          readOnly: true
        - name: kmsg
          mountPath: /dev/kmsg
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log
      - name: kmsg
        hostPath:
          path: /dev/kmsg
      tolerations:
      - operator: Exists
        effect: NoSchedule
```

### 3. High Availability Configuration

**HA Architecture Pattern**
```
                    ┌─────────────────┐
                    │  Load Balancer  │
                    │ (HAProxy/NLB)   │
                    └────────┬────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│ Control Plane │    │ Control Plane │    │ Control Plane │
│     Node 1    │    │     Node 2    │    │     Node 3    │
├───────────────┤    ├───────────────┤    ├───────────────┤
│ API Server    │    │ API Server    │    │ API Server    │
│ Scheduler     │    │ Scheduler     │    │ Scheduler     │
│ Controller    │    │ Controller    │    │ Controller    │
│ etcd          │◄──►│ etcd          │◄──►│ etcd          │
└───────────────┘    └───────────────┘    └───────────────┘
        │                    │                    │
        └────────────────────┴────────────────────┘
                             │
                    ┌────────┴────────┐
                    │  Worker Nodes   │
                    │  (N instances)  │
                    └─────────────────┘
```

**etcd Backup & Restore**
```bash
# Backup etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify backup
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-*.db --write-out=table

# Restore etcd (disaster recovery)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-*.db \
  --data-dir=/var/lib/etcd-restored \
  --name=etcd-0 \
  --initial-cluster=etcd-0=https://10.0.0.10:2380 \
  --initial-advertise-peer-urls=https://10.0.0.10:2380

# Automated backup CronJob
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: bitnami/etcd:3.5
            command:
            - /bin/sh
            - -c
            - |
              etcdctl snapshot save /backup/etcd-\$(date +%Y%m%d-%H%M).db
            env:
            - name: ETCDCTL_API
              value: "3"
            volumeMounts:
            - name: backup
              mountPath: /backup
            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
              readOnly: true
          volumes:
          - name: backup
            persistentVolumeClaim:
              claimName: etcd-backup-pvc
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
          restartPolicy: OnFailure
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:
          - key: node-role.kubernetes.io/control-plane
            effect: NoSchedule
EOF
```

### 4. Cluster Upgrades

**Upgrade Strategy Decision Tree**
```
Upgrade Required?
│
├── Minor Version (1.29 → 1.30)
│   ├── Review release notes for breaking changes
│   ├── Test in staging environment
│   ├── Upgrade control plane first
│   │   └── One node at a time
│   └── Upgrade workers (rolling)
│
├── Patch Version (1.30.0 → 1.30.1)
│   ├── Generally safe, security fixes
│   └── Can upgrade more aggressively
│
└── Major changes in components
    ├── Test thoroughly
    ├── Have rollback plan
    └── Consider blue-green cluster
```

**Production Upgrade Process**
```bash
# Step 1: Upgrade kubeadm on control plane
sudo apt-mark unhold kubeadm
sudo apt-get update && sudo apt-get install -y kubeadm=1.30.0-00
sudo apt-mark hold kubeadm

# Step 2: Plan the upgrade
sudo kubeadm upgrade plan

# Step 3: Apply upgrade on first control plane
sudo kubeadm upgrade apply v1.30.0

# Step 4: Upgrade kubelet and kubectl
kubectl drain control-plane-1 --ignore-daemonsets
sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y kubelet=1.30.0-00 kubectl=1.30.0-00
sudo apt-mark hold kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet
kubectl uncordon control-plane-1

# Step 5: Upgrade additional control planes
sudo kubeadm upgrade node
# Then upgrade kubelet/kubectl as above

# Step 6: Upgrade worker nodes (rolling)
for node in $(kubectl get nodes -l node-role.kubernetes.io/worker -o name); do
  kubectl drain $node --ignore-daemonsets --delete-emptydir-data
  # SSH to node and upgrade packages
  kubectl uncordon $node
  sleep 60  # Allow pods to stabilize
done
```

### 5. Resource Management

**Namespace Resource Quotas**
```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    persistentvolumeclaims: "10"
    requests.storage: 500Gi
    pods: "50"
    services: "20"
    secrets: "50"
    configmaps: "50"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-backend
spec:
  limits:
  - type: Container
    default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 100m
      memory: 128Mi
    min:
      cpu: 50m
      memory: 64Mi
    max:
      cpu: 4
      memory: 8Gi
  - type: PersistentVolumeClaim
    min:
      storage: 1Gi
    max:
      storage: 100Gi
```

**Cluster Autoscaler Configuration**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.0
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        - --scale-down-utilization-threshold=0.5
        resources:
          limits:
            cpu: 100m
            memory: 600Mi
          requests:
            cpu: 100m
            memory: 600Mi
```

## Integration Patterns

### Uses skill: **docker-containers**
- Container runtime configuration
- Image management on nodes
- Registry authentication

### Coordinates with skill: **security**
- RBAC for cluster admins
- Node security hardening
- Audit logging configuration

### Works with skill: **monitoring**
- Cluster health dashboards
- Control plane metrics
- Node resource alerting

## Troubleshooting Guide

### Decision Tree: Cluster Health Issues

```
Cluster Health Problem?
│
├── API Server unreachable
│   ├── Check: systemctl status kube-apiserver
│   ├── Check: /var/log/kube-apiserver.log
│   ├── Verify: etcd connectivity
│   └── Verify: certificates not expired
│
├── Node NotReady
│   ├── Check: kubelet status on node
│   ├── Check: container runtime status
│   ├── Verify: node network connectivity
│   └── Check: disk pressure, memory pressure
│
├── Pods Pending (no scheduling)
│   ├── Check: kubectl describe pod
│   ├── Verify: node resources available
│   ├── Check: taints and tolerations
│   └── Verify: PVC bound (if using volumes)
│
└── etcd Issues
    ├── Check: etcdctl endpoint health
    ├── Check: etcd member list
    ├── Verify: disk I/O performance
    └── Check: cluster quorum
```

### Debug Commands Cheatsheet

```bash
# Cluster-wide diagnostics
kubectl cluster-info dump --output-directory=/tmp/cluster-dump
kubectl get componentstatuses
kubectl get nodes -o wide
kubectl get events --sort-by='.lastTimestamp' -A

# Control plane health
kubectl get pods -n kube-system
kubectl logs -n kube-system kube-apiserver-<node>
kubectl logs -n kube-system kube-scheduler-<node>
kubectl logs -n kube-system kube-controller-manager-<node>

# etcd health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Node diagnostics
kubectl describe node <node-name>
kubectl get node <node-name> -o yaml | grep -A 10 conditions
ssh <node> "journalctl -u kubelet --since '1 hour ago'"

# Certificate expiration check
kubeadm certs check-expiration

# Resource usage
kubectl top nodes
kubectl top pods -A --sort-by=memory
```

## Common Challenges & Solutions

| Challenge | Solution |
|-----------|----------|
| etcd performance degradation | Use SSD storage, tune compaction |
| Certificate expiration | Set up cert-manager, kubeadm renew |
| Node resource exhaustion | Configure eviction thresholds, resource quotas |
| Control plane overload | Add more control plane nodes, tune rate limits |
| Upgrade failures | Always backup etcd, use staged rollouts |
| kubelet not starting | Check containerd socket, certificates |
| API server latency | Enable priority/fairness, scale API servers |
| Cluster state drift | GitOps, regular audits, policy enforcement |

## Success Criteria

| Metric | Target |
|--------|--------|
| Cluster uptime | 99.9% |
| API server latency p99 | <200ms |
| etcd backup success | 100% |
| Node ready status | 100% |
| Upgrade success rate | 100% |
| Certificate validity | >30 days |
| Control plane pods healthy | 100% |

## Resources
- [Official Kubernetes Documentation](https://kubernetes.io/docs/)
- [Kubernetes Cluster Administration](https://kubernetes.io/docs/tasks/administer-cluster/)
- [kubeadm Reference](https://kubernetes.io/docs/reference/setup-tools/kubeadm/)
- [etcd Operations Guide](https://etcd.io/docs/v3.5/op-guide/)

Related Skills

admin-panel-builder

from diegosouzapw/awesome-omni-skill

Expert assistant for creating and maintaining admin panel pages in the KR92 Bible Voice project. Use when creating admin pages, building admin components, integrating with admin navigation, or adding admin features.

admin-dashboard-qa

from diegosouzapw/awesome-omni-skill

Use this skill when implementing, modifying, or fixing the admin dashboard (admin-dashboard-v2). Triggers for tasks involving dashboard UI, components, pages, features, hooks, or API integration. Orchestrates a rigorous QA workflow with PM review, use case writing, testing, and bug fixing cycles.

Admin and Seed Data

from diegosouzapw/awesome-omni-skill

Manage database seeding, reset operations, and the admin interface.

database-admin

from diegosouzapw/awesome-omni-skill

Expert database administrator specializing in modern cloud

administration

from diegosouzapw/awesome-omni-skill

How to monitor usage, track costs, configure analytics, and measure ROI for Claude Code. Use when user asks about monitoring, telemetry, metrics, costs, analytics, or OpenTelemetry.

administering-linux

from diegosouzapw/awesome-omni-skill

Manage Linux systems covering systemd services, process management, filesystems, networking, performance tuning, and troubleshooting. Use when deploying applications, optimizing server performance, diagnosing production issues, or managing users and security on Linux servers.

ssh-server-admin

from diegosouzapw/awesome-omni-skill

Securely connect to and manage remote Linux/Unix servers via SSH. Execute commands, transfer files (SCP/SFTP), set up port forwarding and tunnels. Use when the user asks to SSH into a server, connect to a remote machine, run remote commands, upload/download files to servers, set up tunnels, or perform server administration tasks. Works on Windows, macOS, and Linux.

rails-admin-scaffold

from diegosouzapw/awesome-omni-skill

Generate a full-featured CRUD admin panel for Rails 6.1+ applications with auto-detection of CSS frameworks, pagination gems, and smart field mapping

macos-admin

from diegosouzapw/awesome-omni-skill

System preferences, users, disk utility, SIP, Gatekeeper, FileVault, console logs

home-network-admin

from diegosouzapw/awesome-omni-skill

Manage and troubleshoot Tim's home network, SSH into devices, administer the Synology NAS, and work with Tailscale. Use when the user wants to (1) SSH into or run commands on remote machines (synology, dobro), (2) manage the Synology NAS (files, packages, Docker, backups, Surveillance Station), (3) troubleshoot network connectivity or DNS, (4) check Tailscale status or manage the tailnet, (5) transfer files between machines, (6) check device health or disk usage, (7) manage the Caddy reverse proxy on dobro (*.hopperhosted.com), (8) any home server or home network administration task.

api-admin-ops

from diegosouzapw/awesome-omni-skill

Autonomous API administration agent for monitoring, managing, and troubleshooting third-party API integrations. Primary focus on Twilio (voice/SMS/messaging services), OpenAI (AI/LLM endpoints), and Stripe (payments). Triggers on queries like "check Twilio errors", "audit API config", "why are calls failing", "monitor API usage", "list failed messages", "OpenAI rate limits", "Stripe webhook issues", "buy a phone number", "API health check", or any API management/debugging request.

anthropic_administrator-automation

from diegosouzapw/awesome-omni-skill

Automate Anthropic Admin tasks via Rube MCP (Composio): API keys, usage, workspaces, and organization management. Always search tools first for current schemas.