kubernetes-troubleshooting

Debug Kubernetes pods, services, networking, and scaling issues. Use this skill when troubleshooting K8s deployments, investigating pod failures, or diagnosing cluster problems.

16 stars

Best use case

kubernetes-troubleshooting is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Debug Kubernetes pods, services, networking, and scaling issues. Use this skill when troubleshooting K8s deployments, investigating pod failures, or diagnosing cluster problems.

Teams using kubernetes-troubleshooting should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/kubernetes-troubleshooting/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/devops/kubernetes-troubleshooting/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/kubernetes-troubleshooting/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How kubernetes-troubleshooting Compares

Feature / Agentkubernetes-troubleshootingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Debug Kubernetes pods, services, networking, and scaling issues. Use this skill when troubleshooting K8s deployments, investigating pod failures, or diagnosing cluster problems.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Kubernetes Troubleshooting

You are a Kubernetes expert. Use these systematic debugging patterns when investigating K8s issues.

## Diagnostic Decision Tree

```
Pod not running?
├── Pending → Resource constraints or scheduling issues
│   ├── kubectl describe pod <name> → check Events
│   ├── Insufficient CPU/memory → scale cluster or reduce requests
│   ├── Node selector/affinity not matching → check node labels
│   └── PVC not bound → check storage class and PV availability
├── CrashLoopBackOff → Application crashing on startup
│   ├── kubectl logs <pod> → check application logs
│   ├── kubectl logs <pod> --previous → check last crash logs
│   ├── OOMKilled → increase memory limits
│   ├── Exit code 1 → application error (bad config, missing env)
│   └── Exit code 137 → killed by OOM or liveness probe
├── ImagePullBackOff → Can't pull container image
│   ├── Image name typo → verify image:tag exists
│   ├── Private registry → check imagePullSecrets
│   └── Rate limited → Docker Hub pull limit, use mirror
├── Running but not Ready → Readiness probe failing
│   ├── Check readiness probe config
│   ├── Application not listening on expected port
│   └── Dependency not available (database, cache)
└── Evicted → Node pressure
    ├── Disk pressure → clean up images, expand disk
    └── Memory pressure → reduce workload or add nodes
```

## Essential Debug Commands

### Pod Investigation
```bash
# Overview
kubectl get pods -A                          # All pods, all namespaces
kubectl get pods -o wide                     # With node and IP info
kubectl get pods --sort-by='.status.startTime' # Sorted by age

# Deep inspect
kubectl describe pod <name>                  # Events, conditions, volumes
kubectl logs <name>                          # Current logs
kubectl logs <name> --previous               # Previous crash logs
kubectl logs <name> -c <container>           # Specific container in multi-container pod
kubectl logs <name> --tail=100 -f            # Follow last 100 lines

# Interactive debug
kubectl exec -it <name> -- /bin/sh           # Shell into pod
kubectl exec -it <name> -- env               # Check environment
kubectl exec -it <name> -- cat /etc/resolv.conf  # Check DNS config

# Resource usage
kubectl top pods                             # CPU/memory per pod
kubectl top nodes                            # CPU/memory per node
```

### Service & Networking
```bash
# Check service endpoints
kubectl get endpoints <service>              # Are pods registered?
kubectl get svc <service> -o yaml            # Service config

# DNS resolution (from inside a pod)
kubectl exec -it <pod> -- nslookup <service>
kubectl exec -it <pod> -- wget -qO- http://<service>:<port>/health

# Test connectivity
kubectl run debug --image=nicolaka/netshoot -it --rm -- /bin/bash
# Then: curl, dig, nslookup, tcpdump, ping

# Ingress
kubectl get ingress -A
kubectl describe ingress <name>
```

### Cluster Health
```bash
kubectl get nodes                            # Node status
kubectl describe node <name>                 # Node conditions, allocatable resources
kubectl get events --sort-by='.lastTimestamp' # Recent cluster events
kubectl cluster-info                         # API server status
```

## Common Issues and Fixes

### CrashLoopBackOff
```bash
# 1. Check logs
kubectl logs <pod> --previous

# 2. Common causes:
# - Missing environment variable → check deployment env/configmap/secret
# - Database not reachable → check network policy, service DNS
# - Port conflict → check containerPort in deployment
# - Permissions → check SecurityContext, ServiceAccount

# 3. Debug with overridden command
kubectl run debug --image=<same-image> --command -- sleep 3600
kubectl exec -it debug -- /bin/sh
# Manually run the entrypoint to see errors
```

### OOMKilled (Exit Code 137)
```bash
# Check current limits
kubectl describe pod <name> | grep -A 5 "Limits"

# Fix: increase memory limit
# In deployment spec:
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"  # Increase this

# Monitor actual usage first
kubectl top pod <name>
```

### Service Not Reachable
```bash
# Checklist:
# 1. Pod is Running and Ready?
kubectl get pods -l app=<name>

# 2. Service has endpoints?
kubectl get endpoints <service>
# If empty → labels don't match between Service and Pod

# 3. Port correct?
kubectl get svc <service> -o jsonpath='{.spec.ports[*]}'
# targetPort must match containerPort

# 4. NetworkPolicy blocking?
kubectl get networkpolicy -A
```

### Persistent Volume Issues
```bash
# PVC stuck in Pending
kubectl describe pvc <name>
# Common: no matching PV, storage class missing, capacity insufficient

# Check storage classes
kubectl get storageclass

# Check PVs
kubectl get pv
```

## Resource Right-Sizing

### Requests vs Limits
```yaml
resources:
  requests:          # Guaranteed minimum — scheduler uses this
    cpu: "100m"      # 0.1 CPU core
    memory: "128Mi"
  limits:            # Maximum allowed — killed if exceeded (memory), throttled (CPU)
    cpu: "500m"
    memory: "256Mi"
```

**Rules of thumb:**
- `requests` = average usage + 20% buffer
- `limits` = peak usage + 30% buffer
- Never set `limits` without `requests`
- CPU limits cause throttling — some teams only set requests for CPU
- Memory limits are hard — OOMKilled if exceeded

### HPA (Horizontal Pod Autoscaler)
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
```

## Quick Reference

| Symptom | First Command | Likely Cause |
|---------|--------------|-------------|
| Pod pending | `kubectl describe pod` | Resource constraints |
| Pod crashing | `kubectl logs --previous` | App error or OOM |
| Service unreachable | `kubectl get endpoints` | Label mismatch or no ready pods |
| Slow response | `kubectl top pods` | CPU throttling or memory pressure |
| DNS not resolving | `kubectl exec -- nslookup` | CoreDNS issue or network policy |
| Storage error | `kubectl describe pvc` | No matching PV or storage class |

Related Skills

opentofu-kubernetes-explorer

16
from diegosouzapw/awesome-omni-skill

Explore and manage Kubernetes clusters and resources using OpenTofu/Terraform

linux-troubleshooting

16
from diegosouzapw/awesome-omni-skill

Linux system troubleshooting workflow for diagnosing and resolving system issues, performance problems, and service failures.

learn-kubernetes-space-station-intermediate

16
from diegosouzapw/awesome-omni-skill

Interactive narrative learning session that teaches Kubernetes through a Space Station adventure at intermediate level. Use this session when you want to learn Kubernetes through immersive story-driven chapters, hands-on exercises, and tasks grounded in real, up-to-date documentation.

kubernetes-orchestration

16
from diegosouzapw/awesome-omni-skill

Kubernetes container orchestration. Use when deploying to Kubernetes, writing manifests, configuring Helm charts, or troubleshooting cluster issues.

kubernetes-ops

16
from diegosouzapw/awesome-omni-skill

Kubernetes cluster operations: kubectl commands, manifest generation, Helm charts, RBAC, debugging, and deployment strategies.

kubernetes-operators

16
from diegosouzapw/awesome-omni-skill

Kubernetes infrastructure patterns including operators, Helm, GitOps, and component provisioning.

kubernetes-deployment

16
from diegosouzapw/awesome-omni-skill

Deploy, manage, and scale applications on Kubernetes clusters using manifests, Helm charts, and autoscaling configurations.

kubernetes-deployer

16
from diegosouzapw/awesome-omni-skill

Package and deploy applications to Kubernetes with Dockerfiles, Helm charts, and local Minikube deployment. Use when containerizing applications, creating Kubernetes manifests, setting up Helm charts, deploying to Minikube, or preparing cloud-ready configurations. Focuses on local-first deployment with stateless services.

kubernetes-architect

16
from diegosouzapw/awesome-omni-skill

Expert Kubernetes architect specializing in cloud-native infrastructure, advanced GitOps workflows (ArgoCD/Flux), and enterprise container orchestration.

Kind Local Kubernetes

16
from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "setup Kind", "local Kubernetes", "Kind cluster", "multi-node cluster", "Kubernetes development", "k8s local environment", or works with local Kubernetes clusters using Kind.

flux-troubleshooting

16
from diegosouzapw/awesome-omni-skill

Use when Flux resources show Ready False, reconciliation errors appear in logs, deployments fail to sync from Git, HelmRelease installations fail, source artifacts are not being fetched, or image automation is not updating tags

featbit-deployment-kubernetes

16
from diegosouzapw/awesome-omni-skill

Deploys FeatBit to Kubernetes using Helm Charts. Use when user mentions "Kubernetes", "Helm", "K8s", "kubectl", works with values.yaml files, asks about "cloud deployment", "AKS", "EKS", "GKE", "ingress", or needs production-grade container orchestration setup.