error-debugging-error-analysis
You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.
About this skill
This skill transforms your AI agent into a seasoned error analysis specialist, equipped with deep expertise in diagnosing complex issues within distributed systems. It enables the agent to meticulously analyze production incidents, pinpoint root causes, and provide insightful recommendations for resolution and prevention. Furthermore, the skill guides the agent in proposing and implementing comprehensive observability solutions, ensuring robust monitoring and proactive problem identification to maintain system stability and performance.
Best use case
Investigating production incidents, performing root-cause analysis across services, and designing observability and error handling improvements.
You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.
Users can expect detailed root-cause analysis reports, actionable debugging steps, and strategic recommendations for improving system stability, error handling, and observability, leading to faster incident resolution and reduced downtime.
Practical example
Example input
Our `AuthService` is experiencing intermittent `500` errors, specifically `java.net.SocketTimeoutException` when calling the `UserService`. This happens primarily during peak load. Attached are recent logs from both services, relevant system metrics, and the microservice architecture diagram. Please analyze and provide a root cause, potential solutions, and recommendations for improved monitoring.
Example output
**Incident Analysis: AuthService Intermittent `500` Errors (`java.net.SocketTimeoutException`)**
**Root Cause Hypothesis:**
The intermittent `java.net.SocketTimeoutException` in `AuthService` when calling `UserService` during peak load strongly suggests a resource contention or performance bottleneck within the `UserService` or its underlying data stores. The timeout indicates that `AuthService` is waiting too long for a response, implying `UserService` is either slow to process requests or overwhelmed.
**Potential Solutions:**
1. **Scaling `UserService`:** Increase the number of instances or allocated resources (CPU, memory) for `UserService` to handle peak load efficiently.
2. **Optimize `UserService` Database Queries:** Analyze and optimize slow queries, add appropriate indexes, or consider read replicas if database access is a bottleneck.
3. **Implement Circuit Breaker:** Introduce a circuit breaker pattern in `AuthService` for calls to `UserService` to prevent cascading failures and provide graceful degradation.
4. **Asynchronous Communication:** Explore making certain interactions between `AuthService` and `UserService` asynchronous where feasible to reduce direct request coupling and improve resilience.
5. **Increase Timeout (Temporary):** As a short-term measure, slightly increasing the `AuthService` timeout for `UserService` calls might alleviate *some* symptoms, but does not address the underlying performance issue.
**Observability Recommendations:**
1. **Distributed Tracing:** Ensure end-to-end distributed tracing is fully implemented and correctly configured for both `AuthService` and `UserService` to visualize request flows and pinpoint latency hotspots.
2. **Granular Metrics:**
* **`UserService` Latency:** Track request processing time, queue depth, and resource utilization (CPU, memory, database connections).
* **`AuthService` Error Rates:** Monitor `5xx` error rates and `SocketTimeoutException` frequency specifically.
3. **Alerting:** Set up proactive alerts for:
* Elevated `UserService` latency (e.g., p95 exceeding 500ms).
* High `AuthService` `SocketTimeoutException` rates.
* Resource exhaustion warnings on `UserService` hosts.
4. **Logging Context:** Enhance logging in both services to include unique request IDs and relevant contextual information (e.g., user ID, specific endpoint called) to aid in debugging.
**Next Steps:**
* Prioritize a deeper dive into `UserService` performance metrics during peak load.
* Review `UserService` database query performance.
* Consider a phased implementation of solutions, starting with scaling and optimization, followed by resilience patterns.When to use this skill
- Investigating production incidents or recurring errors
- Performing root-cause analysis across services
- Designing observability and error handling improvements
When not to use this skill
- The task is purely feature development
- You cannot access relevant logs, metrics, or system data
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/error-debugging-error-analysis/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How error-debugging-error-analysis Compares
| Feature / Agent | error-debugging-error-analysis | Standard Approach |
|---|---|---|
| Platform Support | Claude | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | easy | N/A |
Frequently Asked Questions
What does this skill do?
You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.
Which AI agents support this skill?
This skill is designed for Claude.
How difficult is it to install?
The installation complexity is rated as easy. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
SKILL.md Source
# Error Analysis and Resolution You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions. ## Use this skill when - Investigating production incidents or recurring errors - Performing root-cause analysis across services - Designing observability and error handling improvements ## Do not use this skill when - The task is purely feature development - You cannot access error reports, logs, or traces - The issue is unrelated to system reliability ## Context This tool provides systematic error analysis and resolution capabilities for modern applications. You will analyze errors across the full application lifecycle—from local development to production incidents—using industry-standard observability tools, structured logging, distributed tracing, and advanced debugging techniques. Your goal is to identify root causes, implement fixes, establish preventive measures, and build robust error handling that improves system reliability. ## Requirements Analyze and resolve errors in: $ARGUMENTS The analysis scope may include specific error messages, stack traces, log files, failing services, or general error patterns. Adapt your approach based on the provided context. ## Instructions - Gather error context, timestamps, and affected services. - Reproduce or narrow the issue with targeted experiments. - Identify root cause and validate with evidence. - Propose fixes, tests, and preventive measures. - If detailed playbooks are required, open `resources/implementation-playbook.md`. ## Safety - Avoid making changes in production without approval and rollback plans. - Redact secrets and PII from shared diagnostics. ## Resources - `resources/implementation-playbook.md` for detailed analysis frameworks and checklists.
Related Skills
error-diagnostics-error-trace
You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging,
error-debugging-error-trace
You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging, and ensure teams can quickly identify and resolve production issues.
linux-shell-scripting
Provide production-ready shell script templates for common Linux system administration tasks including backups, monitoring, user management, log analysis, and automation. These scripts serve as building blocks for security operations and penetration testing environments.
iterate-pr
Iterate on a PR until CI passes. Use when you need to fix CI failures, address review feedback, or continuously push fixes until all checks are green. Automates the feedback-fix-push-wait cycle.
istio-traffic-management
Comprehensive guide to Istio traffic management for production service mesh deployments.
incident-runbook-templates
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
incident-response-smart-fix
[Extended thinking: This workflow implements a sophisticated debugging and resolution pipeline that leverages AI-assisted debugging tools and observability platforms to systematically diagnose and res
incident-responder
Expert SRE incident responder specializing in rapid problem resolution, modern observability, and comprehensive incident management.
expo-cicd-workflows
Helps understand and write EAS workflow YAML files for Expo projects. Use this skill when the user asks about CI/CD or workflows in an Expo or EAS context, mentions .eas/workflows/, or wants help with EAS build pipelines or deployment automation.
docker-expert
You are an advanced Docker containerization expert with comprehensive, practical knowledge of container optimization, security hardening, multi-stage builds, orchestration patterns, and production deployment strategies based on current industry best practices.
devops-troubleshooter
Expert DevOps troubleshooter specializing in rapid incident response, advanced debugging, and modern observability.
devops-deploy
DevOps e deploy de aplicacoes — Docker, CI/CD com GitHub Actions, AWS Lambda, SAM, Terraform, infraestrutura como codigo e monitoramento.