k8s-troubleshoot

Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.

859 stars

Best use case

k8s-troubleshoot is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.

Teams using k8s-troubleshoot should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/k8s-troubleshoot/SKILL.md --create-dirs "https://raw.githubusercontent.com/rohitg00/kubectl-mcp-server/main/kubernetes-skills/claude/k8s-troubleshoot/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/k8s-troubleshoot/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How k8s-troubleshoot Compares

Feature / Agentk8s-troubleshootStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Kubernetes Troubleshooting

Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.

## When to Apply

Use this skill when:
- User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
- Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
- Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
- Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"

## Priority Rules

| Priority | Rule | Impact | Tools |
|----------|------|--------|-------|
| 1 | Check pod status first | CRITICAL | `get_pods`, `describe_pod` |
| 2 | View recent events | CRITICAL | `get_events` |
| 3 | Inspect logs (including previous) | HIGH | `get_pod_logs` |
| 4 | Check resource metrics | HIGH | `get_pod_metrics` |
| 5 | Verify endpoints | MEDIUM | `get_endpoints` |
| 6 | Review network policies | MEDIUM | `get_network_policies` |
| 7 | Examine node status | LOW | `get_nodes`, `describe_node` |

## Quick Reference

| Symptom | First Tool | Next Steps |
|---------|------------|------------|
| Pod Pending | `describe_pod` | Check events, node capacity, resource requests |
| CrashLoopBackOff | `get_pod_logs(previous=True)` | Check exit code, resources, liveness probes |
| ImagePullBackOff | `describe_pod` | Verify image name, registry auth, network |
| OOMKilled | `get_pod_metrics` | Increase memory limits, check for memory leaks |
| ContainerCreating | `describe_pod` | Check PVC binding, secrets, configmaps |
| Terminating (stuck) | `describe_pod` | Check finalizers, PDBs, preStop hooks |

## Diagnostic Workflows

### Pod Not Starting

```
1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loops
```

### Common Pod States

| State | Likely Cause | Tools to Use |
|-------|-------------|--------------|
| Pending | Scheduling issues | `describe_pod`, `get_nodes`, `get_events` |
| ImagePullBackOff | Registry/auth | `describe_pod`, check image name |
| CrashLoopBackOff | App crash | `get_pod_logs(previous=True)` |
| OOMKilled | Memory limit | `get_pod_metrics`, adjust limits |
| ContainerCreating | Volume/network | `describe_pod`, `get_pvc` |

### Node Issues

```
1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logs
```

## Deep Debugging Workflows

### CrashLoopBackOff Investigation

```
1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack trace
```

### Networking Issues

```
1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()
```

### Storage Problems

```
1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modes
```

### DNS Resolution

```
1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")
```

## Multi-Cluster Debugging

All tools support `context` parameter for targeting different clusters:

```python
get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")
```

## Diagnostic Scripts

For comprehensive diagnostics, run the bundled scripts:
- See [scripts/diagnose-pod.py](scripts/diagnose-pod.py) for automated pod analysis
- See [scripts/health-check.sh](scripts/health-check.sh) for cluster health checks

## Decision Tree

See [references/DECISION-TREE.md](references/DECISION-TREE.md) for visual troubleshooting flowcharts.

## Common Errors Reference

See [references/COMMON-ERRORS.md](references/COMMON-ERRORS.md) for error message explanations and fixes.

## Related Tools

### Core Diagnostics
- `get_pods`, `describe_pod`, `get_pod_logs`, `get_pod_metrics`
- `get_events`, `get_nodes`, `describe_node`
- `get_resource_usage`, `compare_namespaces`

### Advanced (Ecosystem)
- Cilium: `cilium_endpoints_list_tool`, `hubble_flows_query_tool`
- Istio: `istio_proxy_status_tool`, `istio_analyze_tool`

## Related Skills

- [k8s-diagnostics](../k8s-diagnostics/SKILL.md) - Metrics and health checks
- [k8s-incident](../k8s-incident/SKILL.md) - Emergency runbooks
- [k8s-networking](../k8s-networking/SKILL.md) - Network troubleshooting

Related Skills

k8s-vind

859
from rohitg00/kubectl-mcp-server

Manage vCluster (virtual Kubernetes clusters) instances using vind. Use when creating, managing, or operating lightweight virtual clusters for development, testing, or multi-tenancy.

k8s-storage

859
from rohitg00/kubectl-mcp-server

Kubernetes storage management for PVCs, storage classes, and persistent volumes. Use when provisioning storage, managing volumes, or troubleshooting storage issues.

k8s-service-mesh

859
from rohitg00/kubectl-mcp-server

Manage Istio service mesh for traffic management, security, and observability. Use for traffic shifting, canary releases, mTLS, and service mesh troubleshooting.

k8s-security

859
from rohitg00/kubectl-mcp-server

Audit Kubernetes RBAC, enforce policies, and manage secrets. Use for security reviews, permission audits, policy enforcement with Kyverno/Gatekeeper, and secret management.

k8s-rollouts

859
from rohitg00/kubectl-mcp-server

Progressive delivery with Argo Rollouts and Flagger. Use when implementing canary deployments, blue-green deployments, or traffic shifting strategies.

k8s-policy

859
from rohitg00/kubectl-mcp-server

Kubernetes policy management with Kyverno and Gatekeeper. Use when enforcing security policies, validating resources, or auditing policy compliance.

k8s-operations

859
from rohitg00/kubectl-mcp-server

kubectl operations for applying, patching, deleting, and executing commands on Kubernetes resources. Use when modifying resources, running commands in pods, or managing resource lifecycle.

k8s-networking

859
from rohitg00/kubectl-mcp-server

Kubernetes networking management for services, ingresses, endpoints, and network policies. Use when configuring connectivity, load balancing, or network isolation.

k8s-multicluster

859
from rohitg00/kubectl-mcp-server

Manage multiple Kubernetes clusters, switch contexts, and perform cross-cluster operations. Use when working with multiple clusters, comparing environments, or managing cluster lifecycle.

k8s-kubevirt

859
from rohitg00/kubectl-mcp-server

Virtual machine management with KubeVirt on Kubernetes. Use when creating, managing, or troubleshooting VMs running on Kubernetes clusters.

k8s-kind

859
from rohitg00/kubectl-mcp-server

Manage kind (Kubernetes IN Docker) local clusters. Use when creating, testing, or developing with local Kubernetes clusters in Docker containers.

k8s-incident

859
from rohitg00/kubectl-mcp-server

Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.