ClaudeDevOps & Infrastructure

error-debugging-error-analysis

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

31,392 stars

bysickn33

Complexity: easy

View on GitHub Installation ↓

About this skill

This skill transforms your AI agent into a seasoned error analysis specialist, equipped with deep expertise in diagnosing complex issues within distributed systems. It enables the agent to meticulously analyze production incidents, pinpoint root causes, and provide insightful recommendations for resolution and prevention. Furthermore, the skill guides the agent in proposing and implementing comprehensive observability solutions, ensuring robust monitoring and proactive problem identification to maintain system stability and performance.

Best use case

Investigating production incidents, performing root-cause analysis across services, and designing observability and error handling improvements.

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

Users can expect detailed root-cause analysis reports, actionable debugging steps, and strategic recommendations for improving system stability, error handling, and observability, leading to faster incident resolution and reduced downtime.

Practical example

Example input

Our `AuthService` is experiencing intermittent `500` errors, specifically `java.net.SocketTimeoutException` when calling the `UserService`. This happens primarily during peak load. Attached are recent logs from both services, relevant system metrics, and the microservice architecture diagram. Please analyze and provide a root cause, potential solutions, and recommendations for improved monitoring.

Example output

**Incident Analysis: AuthService Intermittent `500` Errors (`java.net.SocketTimeoutException`)**

**Root Cause Hypothesis:**
The intermittent `java.net.SocketTimeoutException` in `AuthService` when calling `UserService` during peak load strongly suggests a resource contention or performance bottleneck within the `UserService` or its underlying data stores. The timeout indicates that `AuthService` is waiting too long for a response, implying `UserService` is either slow to process requests or overwhelmed.

**Potential Solutions:**
1. **Scaling `UserService`:** Increase the number of instances or allocated resources (CPU, memory) for `UserService` to handle peak load efficiently.
2. **Optimize `UserService` Database Queries:** Analyze and optimize slow queries, add appropriate indexes, or consider read replicas if database access is a bottleneck.
3. **Implement Circuit Breaker:** Introduce a circuit breaker pattern in `AuthService` for calls to `UserService` to prevent cascading failures and provide graceful degradation.
4. **Asynchronous Communication:** Explore making certain interactions between `AuthService` and `UserService` asynchronous where feasible to reduce direct request coupling and improve resilience.
5. **Increase Timeout (Temporary):** As a short-term measure, slightly increasing the `AuthService` timeout for `UserService` calls might alleviate *some* symptoms, but does not address the underlying performance issue.

**Observability Recommendations:**
1. **Distributed Tracing:** Ensure end-to-end distributed tracing is fully implemented and correctly configured for both `AuthService` and `UserService` to visualize request flows and pinpoint latency hotspots.
2. **Granular Metrics:**
* **`UserService` Latency:** Track request processing time, queue depth, and resource utilization (CPU, memory, database connections).
* **`AuthService` Error Rates:** Monitor `5xx` error rates and `SocketTimeoutException` frequency specifically.
3. **Alerting:** Set up proactive alerts for:
* Elevated `UserService` latency (e.g., p95 exceeding 500ms).
* High `AuthService` `SocketTimeoutException` rates.
* Resource exhaustion warnings on `UserService` hosts.
4. **Logging Context:** Enhance logging in both services to include unique request IDs and relevant contextual information (e.g., user ID, specific endpoint called) to aid in debugging.

**Next Steps:**
* Prioritize a deeper dive into `UserService` performance metrics during peak load.
* Review `UserService` database query performance.
* Consider a phased implementation of solutions, starting with scaling and optimization, followed by resilience patterns.

When to use this skill

Investigating production incidents or recurring errors
Performing root-cause analysis across services
Designing observability and error handling improvements

When not to use this skill

The task is purely feature development
You cannot access relevant logs, metrics, or system data

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/error-debugging-error-analysis/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/error-debugging-error-analysis/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/error-debugging-error-analysis/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How error-debugging-error-analysis Compares

Feature / Agent	error-debugging-error-analysis	Standard Approach
Platform Support	Claude	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	easy	N/A

Frequently Asked Questions

What does this skill do?

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Top AI Agents for Productivity

See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

SKILL.md Source

# Error Analysis and Resolution

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

## Use this skill when

- Investigating production incidents or recurring errors
- Performing root-cause analysis across services
- Designing observability and error handling improvements

## Do not use this skill when

- The task is purely feature development
- You cannot access error reports, logs, or traces
- The issue is unrelated to system reliability

## Context

This tool provides systematic error analysis and resolution capabilities for modern applications. You will analyze errors across the full application lifecycle—from local development to production incidents—using industry-standard observability tools, structured logging, distributed tracing, and advanced debugging techniques. Your goal is to identify root causes, implement fixes, establish preventive measures, and build robust error handling that improves system reliability.

## Requirements

Analyze and resolve errors in: $ARGUMENTS

The analysis scope may include specific error messages, stack traces, log files, failing services, or general error patterns. Adapt your approach based on the provided context.

## Instructions

- Gather error context, timestamps, and affected services.
- Reproduce or narrow the issue with targeted experiments.
- Identify root cause and validate with evidence.
- Propose fixes, tests, and preventive measures.
- If detailed playbooks are required, open `resources/implementation-playbook.md`.

## Safety

- Avoid making changes in production without approval and rollback plans.
- Redact secrets and PII from shared diagnostics.

## Resources

- `resources/implementation-playbook.md` for detailed analysis frameworks and checklists.

Related Skills

error-diagnostics-error-trace

31392

from sickn33/antigravity-awesome-skills

You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging,

DevOps & InfrastructureClaude

31392

from sickn33/antigravity-awesome-skills

DevOps e deploy de aplicacoes — Docker, CI/CD com GitHub Actions, AWS Lambda, SAM, Terraform, infraestrutura como codigo e monitoramento.

DevOps & InfrastructureClaudeCursorGemini