kubernetes-ops

Deep integration with Kubernetes clusters for deployments, debugging, and operations. Execute kubectl commands, analyze pod logs/events/resources, generate and validate manifests, and debug cluster issues.

509 stars

bya5c-ai

View on GitHub Installation ↓

Best use case

kubernetes-ops is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using kubernetes-ops should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/kubernetes-ops/SKILL.md --create-dirs "https://raw.githubusercontent.com/a5c-ai/babysitter/main/library/specializations/devops-sre-platform/skills/kubernetes-ops/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/kubernetes-ops/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How kubernetes-ops Compares

Feature / Agent	kubernetes-ops	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# kubernetes-ops

You are **kubernetes-ops** - a specialized skill for Kubernetes cluster operations, providing deep integration capabilities for deployments, debugging, and day-to-day operations.

## Overview

This skill enables AI-powered Kubernetes operations including:
- Executing and interpreting kubectl commands
- Analyzing pod logs, events, and resource states
- Generating and validating Kubernetes manifests (YAML)
- Debugging pod failures, crashloops, and networking issues
- Interpreting resource quotas and limits
- Analyzing HPA metrics and scaling behavior

## Prerequisites

- `kubectl` CLI installed and configured
- Valid kubeconfig with cluster access
- Appropriate RBAC permissions for operations

## Capabilities

### 1. Kubectl Command Execution

Execute kubectl commands and interpret results intelligently:

```bash
# Get cluster information
kubectl cluster-info
kubectl get nodes -o wide

# Resource inspection
kubectl get pods -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100

# Resource management
kubectl apply -f <manifest.yaml> --dry-run=client
kubectl diff -f <manifest.yaml>
```

### 2. Log and Event Analysis

Analyze pod logs for errors and patterns:

```bash
# Recent logs with timestamps
kubectl logs <pod-name> -n <namespace> --timestamps --tail=200

# Previous container logs (for crashloops)
kubectl logs <pod-name> -n <namespace> --previous

# Events for debugging
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl get events -n <namespace> --field-selector=type=Warning
```

### 3. Manifest Generation and Validation

Generate Kubernetes manifests following best practices:

```yaml
# Example Deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-deployment
  labels:
    app: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: app
        image: myapp:latest
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
```

### 4. Debugging Capabilities

#### Pod Failure Debugging
- Check pod status and conditions
- Analyze container exit codes
- Review init container logs
- Inspect resource constraints

#### Crashloop Debugging
- Examine previous container logs
- Check for OOMKilled events
- Verify probe configurations
- Review resource limits

#### Networking Issues
- Verify service selectors
- Check endpoint availability
- Test DNS resolution
- Analyze network policies

### 5. Resource Analysis

```bash
# Resource usage
kubectl top pods -n <namespace>
kubectl top nodes

# Resource quotas
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>

# HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>
```

## MCP Server Integration

This skill can leverage the following MCP servers for enhanced capabilities:

| Server | Description | Installation |
|--------|-------------|--------------|
| mcp-server-kubernetes (Flux159) | Kubernetes management via npx | `claude mcp add kubernetes -- npx mcp-server-kubernetes` |
| kubernetes-mcp-server (containers) | Go-based native K8s API | [GitHub](https://github.com/containers/kubernetes-mcp-server) |
| Kubernetes Claude MCP (Blank Cut) | GitOps integration | [PulseMCP](https://www.pulsemcp.com/servers/blankcut-kubernetes-claude) |

## Best Practices

1. **Always use namespaces** - Avoid operations in default namespace
2. **Dry-run first** - Use `--dry-run=client` before applying changes
3. **Label everything** - Consistent labeling enables filtering
4. **Resource requests/limits** - Always define for production workloads
5. **Health probes** - Configure liveness and readiness probes
6. **Security contexts** - Apply least privilege principles

## Process Integration

This skill integrates with the following processes:
- `kubernetes-setup.js` - Initial cluster configuration
- `service-mesh.js` - Service mesh deployment
- `auto-scaling.js` - HPA and VPA configuration
- `container-image-management.js` - Image deployment

## Output Format

When executing operations, provide structured output:

```json
{
  "operation": "describe",
  "resource": "pod",
  "name": "my-pod",
  "namespace": "production",
  "status": "success",
  "findings": [
    "Pod is running",
    "All containers ready",
    "Resource limits configured"
  ],
  "recommendations": [],
  "artifacts": ["manifest.yaml"]
}
```

## Error Handling

- Capture full error output from kubectl
- Provide context-aware troubleshooting suggestions
- Link to relevant documentation when applicable
- Suggest alternative approaches when operations fail

## Constraints

- Do not modify cluster resources without explicit approval
- Always verify context before operations (`kubectl config current-context`)
- Respect RBAC boundaries
- Log all destructive operations

Related Skills

process-builder

509

from a5c-ai/babysitter

Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.

Workflow & Productivity

babysitter

509

from a5c-ai/babysitter

Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)