k8s-debug
Comprehensive Kubernetes debugging and troubleshooting toolkit. Use this skill when diagnosing Kubernetes cluster issues, debugging failing pods, investigating network connectivity problems, analyzing resource usage, troubleshooting deployments, or performing cluster health checks.
Best use case
k8s-debug is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Comprehensive Kubernetes debugging and troubleshooting toolkit. Use this skill when diagnosing Kubernetes cluster issues, debugging failing pods, investigating network connectivity problems, analyzing resource usage, troubleshooting deployments, or performing cluster health checks.
Teams using k8s-debug should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/k8s-debug/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How k8s-debug Compares
| Feature / Agent | k8s-debug | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Comprehensive Kubernetes debugging and troubleshooting toolkit. Use this skill when diagnosing Kubernetes cluster issues, debugging failing pods, investigating network connectivity problems, analyzing resource usage, troubleshooting deployments, or performing cluster health checks.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Kubernetes Debugging Skill ## Overview Systematic toolkit for debugging and troubleshooting Kubernetes clusters, pods, services, and deployments. Provides scripts, workflows, and reference guides for identifying and resolving common Kubernetes issues efficiently. ## When to Use This Skill Invoke this skill when encountering: - Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled) - Service connectivity or DNS resolution issues - Network policy or ingress problems - Volume and storage mount failures - Deployment rollout issues - Cluster health or performance degradation - Resource exhaustion (CPU/memory) - Configuration problems (ConfigMaps, Secrets, RBAC) ## Debugging Workflow Follow this systematic approach for any Kubernetes issue: ### 1. Identify the Problem Layer Categorize the issue: - **Application Layer**: Application crashes, errors, bugs - **Pod Layer**: Pod not starting, restarting, or pending - **Service Layer**: Network connectivity, DNS issues - **Node Layer**: Node not ready, resource exhaustion - **Cluster Layer**: Control plane issues, API problems - **Storage Layer**: Volume mount failures, PVC issues - **Configuration Layer**: ConfigMap, Secret, RBAC issues ### 2. Gather Diagnostic Information Use the appropriate diagnostic script based on scope: #### Pod-Level Diagnostics Use `scripts/pod_diagnostics.py` for comprehensive pod analysis: ```bash python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> ``` This script gathers: - Pod status and description - Pod events - Container logs (current and previous) - Resource usage - Node information - YAML configuration Output can be saved for analysis: `python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt` #### Cluster-Level Health Check Use `scripts/cluster_health.sh` for overall cluster diagnostics: ```bash ./scripts/cluster_health.sh ``` This script checks: - Cluster info and version - Node status and resources - Pods across all namespaces - Failed/pending pods - Recent events - Deployments, services, statefulsets, daemonsets - PVCs and PVs - Component health - Common error states (CrashLoopBackOff, ImagePullBackOff) #### Network Diagnostics Use `scripts/network_debug.sh` for connectivity issues: ```bash ./scripts/network_debug.sh <namespace> <pod-name> ``` This script analyzes: - Pod network configuration - DNS setup and resolution - Service endpoints - Network policies - Connectivity tests - CoreDNS logs ### 3. Follow Issue-Specific Workflow Based on the identified issue, consult `references/troubleshooting_workflow.md` for detailed workflows: - **Pod Pending**: Resource/scheduling workflow - **CrashLoopBackOff**: Application crash workflow - **ImagePullBackOff**: Image pull workflow - **Service issues**: Network connectivity workflow - **DNS failures**: DNS troubleshooting workflow - **Resource exhaustion**: Performance investigation workflow - **Storage issues**: PVC binding workflow - **Deployment stuck**: Rollout workflow ### 4. Apply Targeted Fixes Refer to `references/common_issues.md` for specific solutions to common problems. ## Common Debugging Patterns ### Pattern 1: Pod Not Starting ```bash # Quick assessment kubectl get pod <pod-name> -n <namespace> kubectl describe pod <pod-name> -n <namespace> # Detailed diagnostics python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> # Check common causes: # - ImagePullBackOff: Verify image exists and credentials # - CrashLoopBackOff: Check logs with --previous flag # - Pending: Check node resources and scheduling ``` ### Pattern 2: Service Connectivity Issues ```bash # Verify service and endpoints kubectl get svc <service-name> -n <namespace> kubectl get endpoints <service-name> -n <namespace> # Network diagnostics ./scripts/network_debug.sh <namespace> <pod-name> # Test connectivity from debug pod kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash # Inside: curl <service-name>.<namespace>.svc.cluster.local:<port> # Check network policies kubectl get networkpolicies -n <namespace> ``` ### Pattern 3: Application Performance Issues ```bash # Check resource usage kubectl top nodes kubectl top pods -n <namespace> --containers # Get pod metrics kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources # Check for OOMKilled kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 lastState # Review application logs kubectl logs <pod-name> -n <namespace> --tail=100 ``` ### Pattern 4: Cluster Health Assessment ```bash # Run comprehensive health check ./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt # Review output for: # - Node conditions and resource pressure # - Failed or pending pods # - Recent error events # - Component health status # - Resource quota usage ``` ## Essential Manual Commands While scripts automate diagnostics, understand these core commands: ### Pod Debugging ```bash # View pod status kubectl get pods -n <namespace> -o wide # Detailed pod information kubectl describe pod <pod-name> -n <namespace> # View logs kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # Previous container kubectl logs <pod-name> -n <namespace> -c <container> # Specific container # Execute commands in pod kubectl exec <pod-name> -n <namespace> -it -- /bin/sh # Get pod YAML kubectl get pod <pod-name> -n <namespace> -o yaml ``` ### Service and Network Debugging ```bash # Check services kubectl get svc -n <namespace> kubectl describe svc <service-name> -n <namespace> # Check endpoints kubectl get endpoints -n <namespace> # Test DNS kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default # View events kubectl get events -n <namespace> --sort-by='.lastTimestamp' ``` ### Resource Monitoring ```bash # Node resources kubectl top nodes kubectl describe nodes # Pod resources kubectl top pods -n <namespace> kubectl top pod <pod-name> -n <namespace> --containers ``` ### Emergency Operations ```bash # Restart deployment kubectl rollout restart deployment/<name> -n <namespace> # Rollback deployment kubectl rollout undo deployment/<name> -n <namespace> # Force delete stuck pod kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0 # Drain node (maintenance) kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data # Cordon node (prevent scheduling) kubectl cordon <node-name> ``` ## Reference Documentation ### Detailed Troubleshooting Guides Consult `references/troubleshooting_workflow.md` for: - Step-by-step workflows for each issue type - Decision trees for diagnosis - Command sequences for systematic debugging - Quick reference command cheat sheet ### Common Issues Database Consult `references/common_issues.md` for: - Detailed explanations of each common issue - Symptoms and causes - Specific debugging steps - Solutions and fixes - Prevention strategies ## Best Practices ### Systematic Approach 1. **Observe**: Gather facts before making changes 2. **Analyze**: Use diagnostic scripts to collect comprehensive data 3. **Hypothesize**: Form theory about root cause 4. **Test**: Verify hypothesis with targeted commands 5. **Fix**: Apply appropriate solution 6. **Verify**: Confirm issue is resolved 7. **Document**: Record findings for future reference ### Data Collection - Save diagnostic output to files for analysis - Capture logs before restarting failing pods - Record events timeline for incident reports - Export resource metrics for trend analysis ### Prevention - Set appropriate resource requests and limits - Implement health checks (liveness/readiness probes) - Use proper logging and monitoring - Apply network policies incrementally - Test changes in non-production environments - Maintain documentation of cluster architecture ## Advanced Debugging Techniques ### Debug Containers (Kubernetes 1.23+) ```bash # Attach ephemeral debug container kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot # Create debug copy of pod kubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container> ``` ### Port Forwarding for Testing ```bash # Forward pod port to local machine kubectl port-forward pod/<pod-name> -n <namespace> <local-port>:<pod-port> # Forward service port kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port> ``` ### Proxy for API Access ```bash # Start kubectl proxy kubectl proxy --port=8080 # Access API curl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name> ``` ### Custom Column Output ```bash # Custom pod info kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP # Node taints kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints ``` ## Troubleshooting Checklist Before escalating issues, verify: - [ ] Reviewed pod events: `kubectl describe pod` - [ ] Checked pod logs (current and previous) - [ ] Verified resource availability on nodes - [ ] Confirmed image exists and is accessible - [ ] Validated service selectors match pod labels - [ ] Tested DNS resolution from pods - [ ] Checked network policies - [ ] Reviewed recent cluster events - [ ] Confirmed ConfigMaps/Secrets exist - [ ] Validated RBAC permissions - [ ] Checked for resource quotas/limits - [ ] Reviewed cluster component health ## Related Tools Useful additional tools for Kubernetes debugging: - **kubectl-debug**: Advanced debugging plugin - **stern**: Multi-pod log tailing - **kubectx/kubens**: Context and namespace switching - **k9s**: Terminal UI for Kubernetes - **lens**: Desktop IDE for Kubernetes - **Prometheus/Grafana**: Monitoring and alerting - **Jaeger/Zipkin**: Distributed tracing
Related Skills
error-debugging-error-trace
You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured loggi...
debugging-dags
Comprehensive DAG failure diagnosis and root cause analysis. Use for complex debugging requests requiring deep investigation like "diagnose and fix the pipeline", "full root cause analysis", "why is this failing and how to prevent it". For simple debugging ("why did dag fail", "show logs"), the airflow entrypoint skill handles it directly. This skill provides structured investigation and prevention recommendations.
debug:kubernetes
Debug Kubernetes clusters and workloads systematically with this comprehensive troubleshooting skill. Covers CrashLoopBackOff, ImagePullBackOff, OOMKilled, pending pods, service connectivity issues, PVC binding failures, and RBAC permission errors. Provides structured four-phase debugging methodology with kubectl commands, ephemeral debug containers, and essential one-liners for diagnosing pod, service, network, and storage problems across namespaces.
user-state-debugging
Expert knowledge on debugging user account issues, diagnostic scripts (inspect-user-state.js), fix scripts (fix-user-billing-state.js, reset-user-onboarding.js), onboarding problems, billing sync issues, and Clerk vs database mismatches. Use this skill when user asks about "user stuck", "onboarding broken", "billing out of sync", "debug user", "reset user", or "user state".
systematic-debugging
Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes
sentry-debugger
Debug production issues using Sentry error tracking API. Use when Claude needs to investigate production errors, crashes, or issues by fetching error data from Sentry - including stack traces, user context, breadcrumbs, and error frequency. Triggers on requests like "check Sentry for errors", "debug this production issue", "what's causing crashes", "investigate errors in [project]", or when users share Sentry issue URLs.
qa-debugging
Systematic debugging methodologies, troubleshooting workflows, logging strategies, error tracking, performance profiling, stack trace analysis, and debugging tools across languages and environments. Covers local debugging, distributed systems, production issues, and root cause analysis.
python-automated-debugging
Use when fixing a Python test that has failed multiple attempts, when print-debugging hasn't revealed the issue, or when you need to investigate runtime state systematically
nextjs-production-debugger
Advanced debugging guide for Next.js App Router production issues including SSR/CSR bugs, hydration errors, runtime mismatches, performance, and caching.
methodical-debugging
Systematic debugging approach using parallel investigation and test-driven validation. Use when debugging issues, when stuck in a loop of trying different fixes, or when facing complex bugs that resist standard debugging approaches.
juliaz-debug
Cross-system diagnostics and troubleshooting for the juliaz_agents multi-agent system. Trigger when Raphael reports something is broken, not working, or behaving unexpectedly — messages not arriving, Julia not responding, bridge errors, orchestrator crashes, rate limits, silent failures, queue issues, or any 'why isn't X working' question. Also trigger for: 'debug', 'broken', 'not working', 'error', 'Julia isn't responding', 'messages stuck', 'bridge down', 'check logs', 'what went wrong', or any troubleshooting request. If something in the multi-agent system is misbehaving, this is the skill to reach for.
js-reverse-automation-page-redirect-debugger
页面跳转 JS 代码定位通杀方案:在跳转前触发 debugger 以定位调用源。仅在确认跳转定位需求时启用。