distributed-debugging-debug-trace
You are a debugging expert specializing in setting up comprehensive debugging environments, distributed tracing, and diagnostic tools. Configure debugging workflows, implement tracing solutions, and establish troubleshooting practices for development and production environments.
About this skill
This skill transforms your AI agent into a seasoned debugging expert, specializing in complex software environments. It enables the agent to provide sophisticated guidance on setting up robust debugging infrastructures, architecting distributed tracing solutions across microservices and cloud-native applications, and establishing effective diagnostic and troubleshooting practices. The AI can configure detailed debugging workflows, advise on implementing various tracing technologies (like OpenTelemetry), recommend diagnostic tools, and define best practices for both development lifecycle and production incident response, enhancing observability and reliability.
Best use case
Troubleshooting complex distributed systems, optimizing development debugging processes, implementing observability best practices, responding to production incidents, designing system diagnostics, and creating documentation or guidelines for debugging and tracing.
You are a debugging expert specializing in setting up comprehensive debugging environments, distributed tracing, and diagnostic tools. Configure debugging workflows, implement tracing solutions, and establish troubleshooting practices for development and production environments.
Comprehensive plans for debugging environments, detailed distributed tracing strategies and configurations, expert recommendations for diagnostic tools, best practice guidelines for troubleshooting, or step-by-step guidance for resolving complex software issues.
Practical example
Example input
Help me design a robust distributed tracing strategy for our new cloud-native application, which uses Kubernetes, Kafka, and a mix of Python and Node.js microservices. We need to effectively track requests, identify latency, and correlate logs for rapid issue resolution.
Example output
Okay, as your debugging expert, let's outline a robust distributed tracing strategy for your Kubernetes, Kafka, Python, and Node.js cloud-native application: **1. Adopt OpenTelemetry (OTel) as the Standard:** * **Why:** OpenTelemetry is a vendor-neutral, open-source standard for instrumenting, generating, and exporting telemetry data (traces, metrics, and logs). It provides consistent APIs and SDKs across languages, ensuring interoperability. * **Action:** Integrate OTel SDKs into all your Python and Node.js microservices. **2. Service Instrumentation:** * **Auto-Instrumentation:** Utilize OTel's auto-instrumentation capabilities for common frameworks (e.g., Express.js, Flask, FastAPI, gRPC, HTTP clients) to automatically capture basic request spans. * **Manual Instrumentation:** Implement manual instrumentation for critical business logic, database queries, external API calls, and especially for Kafka producer/consumer interactions. Ensure `trace_id` and `span_id` are propagated correctly across service boundaries (e.g., as HTTP headers, Kafka message headers). **3. Context Propagation via Kafka:** * **Producers:** Before sending a message to Kafka, extract the current OTel trace context and inject it into the Kafka message headers. * **Consumers:** Upon receiving a message, extract the trace context from the Kafka message headers and use it to continue the trace. This ensures end-to-end visibility through your asynchronous message queues. **4. OpenTelemetry Collectors:** * **Deployment:** Deploy OTel Collectors within your Kubernetes cluster, ideally as a DaemonSet or Sidecar for high-traffic services, or as a Deployment for collecting traces from multiple services. * **Function:** Collectors can receive traces from your services, batch, process (e.g., attribute modification, sampling), and export them to your chosen backend. **5. Choose a Tracing Backend:** * **Managed Services:** Integrate with a cloud-native APM solution (e.g., Datadog, New Relic, Honeycomb, AWS X-Ray, Google Cloud Trace) that natively supports OpenTelemetry. These offer robust UIs, alerting, and analysis. * **Self-Hosted:** Consider deploying Jaeger or Grafana Tempo within your cluster for trace storage and visualization. These require more operational overhead but offer full control. **6. Log and Metric Correlation:** * **Logging:** Ensure your logging system (e.g., ELK stack, Grafana Loki) captures the `trace_id` and `span_id` from the OpenTelemetry context with every log line. This allows you to jump directly from a trace span to relevant logs for deep debugging. * **Metrics:** Leverage OpenTelemetry to generate service-level metrics (e.g., request count, latency, error rates) alongside traces. Correlate these metrics with traces to identify performance anomalies and drill down into specific problematic requests. **7. Debugging Workflow & Best Practices:** * **Development:** Encourage developers to review local or dev environment traces to understand execution flows and identify performance issues early. * **Production:** Configure dashboards (e.g., Grafana) to visualize service maps, trace dependencies, and critical path latencies. Set up alerts for unexpected trace patterns (e.g., increased error spans, long-running transactions). * **Sampling:** Implement intelligent sampling strategies (e.g., head-based, tail-based sampling) in your OTel Collectors to manage trace volume while retaining critical traces. **Key Implementation Steps:** 1. **Dependency Addition:** Add OTel SDKs and instrumentation libraries to all Python and Node.js services. 2. **Configuration:** Configure OTel exporters to send data to your OTel Collectors. 3. **Collector Deployment:** Deploy and configure OTel Collectors in Kubernetes. 4. **Backend Integration:** Set up your chosen tracing backend and ensure it's receiving data. 5. **Validation:** Test end-to-end tracing with representative request flows.
When to use this skill
- When you need to design or refine debugging workflows for development teams.
- When implementing or improving distributed tracing in microservices, serverless, or cloud-native architectures.
- When establishing or enhancing incident response and troubleshooting practices for production systems.
- When diagnosing performance bottlenecks, elusive bugs, or intermittent failures in complex applications.
When not to use this skill
- For direct, real-time interaction with a live debugger or executing code on a remote system (unless integrated with a specific execution tool skill).
- When requiring physical hardware debugging or very low-level machine code analysis without specific tooling integration.
- When the primary need is for a specific, pre-defined tool invocation rather than expert advice or configuration generation.
- For tasks that require direct human intuition or sensory perception that AI cannot replicate.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/distributed-debugging-debug-trace/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How distributed-debugging-debug-trace Compares
| Feature / Agent | distributed-debugging-debug-trace | Standard Approach |
|---|---|---|
| Platform Support | Claude | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | easy | N/A |
Frequently Asked Questions
What does this skill do?
You are a debugging expert specializing in setting up comprehensive debugging environments, distributed tracing, and diagnostic tools. Configure debugging workflows, implement tracing solutions, and establish troubleshooting practices for development and production environments.
Which AI agents support this skill?
This skill is designed for Claude.
How difficult is it to install?
The installation complexity is rated as easy. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# Debug and Trace Configuration You are a debugging expert specializing in setting up comprehensive debugging environments, distributed tracing, and diagnostic tools. Configure debugging workflows, implement tracing solutions, and establish troubleshooting practices for development and production environments. ## Use this skill when - Setting up debugging workflows for teams - Implementing distributed tracing and observability - Diagnosing production or multi-service issues - Establishing logging and diagnostics standards ## Do not use this skill when - The system is single-process and simple debugging suffices - You cannot modify logging, tracing, or runtime configs - The task is unrelated to debugging or observability ## Context The user needs to set up debugging and tracing capabilities to efficiently diagnose issues, track down bugs, and understand system behavior. Focus on developer productivity, production debugging, distributed tracing, and comprehensive logging strategies. ## Requirements $ARGUMENTS ## Instructions - Identify services, trace boundaries, and key spans. - Configure local debugging and production-safe tracing. - Standardize log/trace fields and correlation IDs. - Validate end-to-end trace coverage and sampling. - If detailed workflows are required, open `resources/implementation-playbook.md`. ## Safety - Avoid enabling verbose tracing in production without safeguards. - Redact secrets and PII from logs and traces. ## Resources - `resources/implementation-playbook.md` for detailed tooling and configuration patterns.
Related Skills
environment-setup-guide
Guide developers through setting up development environments with proper tools, dependencies, and configurations
ios-debugger-agent
Debug the current iOS project on a booted simulator with XcodeBuildMCP.
error-diagnostics-smart-debug
Use when working with error diagnostics smart debug
error-diagnostics-error-trace
You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging,
error-debugging-multi-agent-review
Use when working with error debugging multi agent review
error-debugging-error-trace
You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging, and ensure teams can quickly identify and resolve production issues.
error-debugging-error-analysis
You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.
distributed-tracing
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
debugging-toolkit-smart-debug
Use when working with debugging toolkit smart debug
debugging-strategies
Transform debugging from frustrating guesswork into systematic problem-solving with proven strategies, powerful tools, and methodical approaches.
debugger
Debugging specialist for errors, test failures, and unexpected behavior. Use proactively when encountering any issues.
debug-buttercup
All pods run in namespace crs. Use when pods in the crs namespace are in CrashLoopBackOff, OOMKilled, or restarting, multiple services restart simultaneously (cascade failure), or redis is unresponsive or showing AOF warnings.