k8s-incident

Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.

859 stars

Best use case

k8s-incident is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.

Teams using k8s-incident should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/k8s-incident/SKILL.md --create-dirs "https://raw.githubusercontent.com/rohitg00/kubectl-mcp-server/main/kubernetes-skills/claude/k8s-incident/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/k8s-incident/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How k8s-incident Compares

Feature / Agentk8s-incidentStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Kubernetes Incident Response

Runbooks and diagnostic workflows for common Kubernetes incidents.

## When to Apply

Use this skill when:
- User mentions: "incident", "outage", "emergency", "down", "not working"
- Operations: emergency response, production issues, service degradation
- Keywords: "urgent", "broken", "fix", "restore", "recover"

## Priority Rules

| Priority | Rule | Impact | Tools |
|----------|------|--------|-------|
| 1 | Check control plane first | CRITICAL | `get_pods(namespace="kube-system")` |
| 2 | Assess node health | CRITICAL | `get_nodes` |
| 3 | Gather events before changes | HIGH | `get_events` |
| 4 | Document timeline | HIGH | Manual notes |
| 5 | Rollback if safe | MEDIUM | `rollback_deployment` |

## Quick Reference

| Incident | First Tool | Next Steps |
|----------|------------|------------|
| Pod failure | `get_pod_logs(previous=True)` | `describe_pod`, `get_events` |
| Node down | `describe_node` | Check kubelet logs |
| Service unreachable | `get_endpoints` | `get_network_policies` |
| Control plane | `get_pods(namespace="kube-system")` | Check API server logs |

## Incident Triage

### Quick Health Check

```python
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)
```

### Severity Assessment

| Indicator | Severity | Action |
|-----------|----------|--------|
| Multiple nodes NotReady | Critical | Escalate immediately |
| kube-system pods failing | Critical | Control plane issue |
| Single pod CrashLoop | Medium | Debug pod |
| High latency | Medium | Check resources |

## Runbook: Pod Failures

### CrashLoopBackOff

```python
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)
```

**Common Causes:**
- OOMKilled → Increase memory limits
- Exit code 1 → Application error in logs
- Exit code 137 → Killed by OOM or SIGKILL
- Exit code 143 → Graceful SIGTERM

### ImagePullBackOff

```python
describe_pod(name, namespace)
get_secrets(namespace)
```

### Pending Pod

```python
describe_pod(name, namespace)
get_nodes()
get_events(namespace)
```

## Runbook: Node Issues

### Node NotReady

```python
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")
```

### Node DiskPressure

```python
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")
```

## Runbook: Network Issues

### Service Not Accessible

```python
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)
```

### DNS Resolution Failures

```python
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")
```

### With Cilium

```python
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)
```

### With Istio

```python
istio_analyze_tool(namespace)
istio_proxy_status_tool()
```

## Runbook: Storage Issues

### PVC Pending

```python
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)
```

### Pod Stuck in ContainerCreating

```python
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)
```

## Runbook: Control Plane Issues

### API Server Unavailable

```python
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")
```

### etcd Issues

```python
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")
```

## Emergency Actions

### Force Delete Pod

```python
delete_pod(name, namespace, grace_period=0, force=True)
```

### Rollback Deployment

```python
rollback_deployment(name, namespace, revision=0)
```

### Helm Rollback

```python
rollback_helm_release(name, namespace, revision=1)
```

## Diagnostic Collection Script

For comprehensive incident diagnostics, see [scripts/collect-diagnostics.py](scripts/collect-diagnostics.py).

## Multi-Cluster Incident Response

Check all clusters:

```python
for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)
```

## Post-Incident

### Document Timeline

1. When did the incident start?
2. What was the impact?
3. What was the root cause?
4. What fixed it?

### Prevent Recurrence

- Add monitoring/alerting
- Improve resource limits
- Add readiness probes
- Document runbook

## Related Skills

- [k8s-troubleshoot](../k8s-troubleshoot/SKILL.md) - Detailed debugging
- [k8s-security](../k8s-security/SKILL.md) - Security incidents

Related Skills

k8s-vind

859
from rohitg00/kubectl-mcp-server

Manage vCluster (virtual Kubernetes clusters) instances using vind. Use when creating, managing, or operating lightweight virtual clusters for development, testing, or multi-tenancy.

k8s-troubleshoot

859
from rohitg00/kubectl-mcp-server

Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.

k8s-storage

859
from rohitg00/kubectl-mcp-server

Kubernetes storage management for PVCs, storage classes, and persistent volumes. Use when provisioning storage, managing volumes, or troubleshooting storage issues.

k8s-service-mesh

859
from rohitg00/kubectl-mcp-server

Manage Istio service mesh for traffic management, security, and observability. Use for traffic shifting, canary releases, mTLS, and service mesh troubleshooting.

k8s-security

859
from rohitg00/kubectl-mcp-server

Audit Kubernetes RBAC, enforce policies, and manage secrets. Use for security reviews, permission audits, policy enforcement with Kyverno/Gatekeeper, and secret management.

k8s-rollouts

859
from rohitg00/kubectl-mcp-server

Progressive delivery with Argo Rollouts and Flagger. Use when implementing canary deployments, blue-green deployments, or traffic shifting strategies.

k8s-policy

859
from rohitg00/kubectl-mcp-server

Kubernetes policy management with Kyverno and Gatekeeper. Use when enforcing security policies, validating resources, or auditing policy compliance.

k8s-operations

859
from rohitg00/kubectl-mcp-server

kubectl operations for applying, patching, deleting, and executing commands on Kubernetes resources. Use when modifying resources, running commands in pods, or managing resource lifecycle.

k8s-networking

859
from rohitg00/kubectl-mcp-server

Kubernetes networking management for services, ingresses, endpoints, and network policies. Use when configuring connectivity, load balancing, or network isolation.

k8s-multicluster

859
from rohitg00/kubectl-mcp-server

Manage multiple Kubernetes clusters, switch contexts, and perform cross-cluster operations. Use when working with multiple clusters, comparing environments, or managing cluster lifecycle.

k8s-kubevirt

859
from rohitg00/kubectl-mcp-server

Virtual machine management with KubeVirt on Kubernetes. Use when creating, managing, or troubleshooting VMs running on Kubernetes clusters.

k8s-kind

859
from rohitg00/kubectl-mcp-server

Manage kind (Kubernetes IN Docker) local clusters. Use when creating, testing, or developing with local Kubernetes clusters in Docker containers.