error-debugging-error-analysis

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

31,392 stars
Complexity: easy

About this skill

This skill transforms your AI agent into a seasoned error analysis specialist, equipped with deep expertise in diagnosing complex issues within distributed systems. It enables the agent to meticulously analyze production incidents, pinpoint root causes, and provide insightful recommendations for resolution and prevention. Furthermore, the skill guides the agent in proposing and implementing comprehensive observability solutions, ensuring robust monitoring and proactive problem identification to maintain system stability and performance.

Best use case

Investigating production incidents, performing root-cause analysis across services, and designing observability and error handling improvements.

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

Users can expect detailed root-cause analysis reports, actionable debugging steps, and strategic recommendations for improving system stability, error handling, and observability, leading to faster incident resolution and reduced downtime.

Practical example

Example input

Our `AuthService` is experiencing intermittent `500` errors, specifically `java.net.SocketTimeoutException` when calling the `UserService`. This happens primarily during peak load. Attached are recent logs from both services, relevant system metrics, and the microservice architecture diagram. Please analyze and provide a root cause, potential solutions, and recommendations for improved monitoring.

Example output

**Incident Analysis: AuthService Intermittent `500` Errors (`java.net.SocketTimeoutException`)**

**Root Cause Hypothesis:**
The intermittent `java.net.SocketTimeoutException` in `AuthService` when calling `UserService` during peak load strongly suggests a resource contention or performance bottleneck within the `UserService` or its underlying data stores. The timeout indicates that `AuthService` is waiting too long for a response, implying `UserService` is either slow to process requests or overwhelmed.

**Potential Solutions:**
1.  **Scaling `UserService`:** Increase the number of instances or allocated resources (CPU, memory) for `UserService` to handle peak load efficiently.
2.  **Optimize `UserService` Database Queries:** Analyze and optimize slow queries, add appropriate indexes, or consider read replicas if database access is a bottleneck.
3.  **Implement Circuit Breaker:** Introduce a circuit breaker pattern in `AuthService` for calls to `UserService` to prevent cascading failures and provide graceful degradation.
4.  **Asynchronous Communication:** Explore making certain interactions between `AuthService` and `UserService` asynchronous where feasible to reduce direct request coupling and improve resilience.
5.  **Increase Timeout (Temporary):** As a short-term measure, slightly increasing the `AuthService` timeout for `UserService` calls might alleviate *some* symptoms, but does not address the underlying performance issue.

**Observability Recommendations:**
1.  **Distributed Tracing:** Ensure end-to-end distributed tracing is fully implemented and correctly configured for both `AuthService` and `UserService` to visualize request flows and pinpoint latency hotspots.
2.  **Granular Metrics:**
    *   **`UserService` Latency:** Track request processing time, queue depth, and resource utilization (CPU, memory, database connections).
    *   **`AuthService` Error Rates:** Monitor `5xx` error rates and `SocketTimeoutException` frequency specifically.
3.  **Alerting:** Set up proactive alerts for:
    *   Elevated `UserService` latency (e.g., p95 exceeding 500ms).
    *   High `AuthService` `SocketTimeoutException` rates.
    *   Resource exhaustion warnings on `UserService` hosts.
4.  **Logging Context:** Enhance logging in both services to include unique request IDs and relevant contextual information (e.g., user ID, specific endpoint called) to aid in debugging.

**Next Steps:**
*   Prioritize a deeper dive into `UserService` performance metrics during peak load.
*   Review `UserService` database query performance.
*   Consider a phased implementation of solutions, starting with scaling and optimization, followed by resilience patterns.

When to use this skill

  • Investigating production incidents or recurring errors
  • Performing root-cause analysis across services
  • Designing observability and error handling improvements

When not to use this skill

  • The task is purely feature development
  • You cannot access relevant logs, metrics, or system data

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/error-debugging-error-analysis/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/error-debugging-error-analysis/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/error-debugging-error-analysis/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How error-debugging-error-analysis Compares

Feature / Agenterror-debugging-error-analysisStandard Approach
Platform SupportClaudeLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Error Analysis and Resolution

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

## Use this skill when

- Investigating production incidents or recurring errors
- Performing root-cause analysis across services
- Designing observability and error handling improvements

## Do not use this skill when

- The task is purely feature development
- You cannot access error reports, logs, or traces
- The issue is unrelated to system reliability

## Context

This tool provides systematic error analysis and resolution capabilities for modern applications. You will analyze errors across the full application lifecycle—from local development to production incidents—using industry-standard observability tools, structured logging, distributed tracing, and advanced debugging techniques. Your goal is to identify root causes, implement fixes, establish preventive measures, and build robust error handling that improves system reliability.

## Requirements

Analyze and resolve errors in: $ARGUMENTS

The analysis scope may include specific error messages, stack traces, log files, failing services, or general error patterns. Adapt your approach based on the provided context.

## Instructions

- Gather error context, timestamps, and affected services.
- Reproduce or narrow the issue with targeted experiments.
- Identify root cause and validate with evidence.
- Propose fixes, tests, and preventive measures.
- If detailed playbooks are required, open `resources/implementation-playbook.md`.

## Safety

- Avoid making changes in production without approval and rollback plans.
- Redact secrets and PII from shared diagnostics.

## Resources

- `resources/implementation-playbook.md` for detailed analysis frameworks and checklists.

Related Skills

error-diagnostics-error-trace

31392
from sickn33/antigravity-awesome-skills

You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging,

DevOps & InfrastructureClaude

error-debugging-error-trace

31392
from sickn33/antigravity-awesome-skills

You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging, and ensure teams can quickly identify and resolve production issues.

DevOps & InfrastructureClaude

linux-shell-scripting

31392
from sickn33/antigravity-awesome-skills

Provide production-ready shell script templates for common Linux system administration tasks including backups, monitoring, user management, log analysis, and automation. These scripts serve as building blocks for security operations and penetration testing environments.

DevOps & InfrastructureClaude

iterate-pr

31392
from sickn33/antigravity-awesome-skills

Iterate on a PR until CI passes. Use when you need to fix CI failures, address review feedback, or continuously push fixes until all checks are green. Automates the feedback-fix-push-wait cycle.

DevOps & InfrastructureClaude

istio-traffic-management

31392
from sickn33/antigravity-awesome-skills

Comprehensive guide to Istio traffic management for production service mesh deployments.

DevOps & InfrastructureClaude

incident-runbook-templates

31392
from sickn33/antigravity-awesome-skills

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

DevOps & InfrastructureClaude

incident-response-smart-fix

31392
from sickn33/antigravity-awesome-skills

[Extended thinking: This workflow implements a sophisticated debugging and resolution pipeline that leverages AI-assisted debugging tools and observability platforms to systematically diagnose and res

DevOps & InfrastructureClaudeGitHub Copilot

incident-responder

31392
from sickn33/antigravity-awesome-skills

Expert SRE incident responder specializing in rapid problem resolution, modern observability, and comprehensive incident management.

DevOps & InfrastructureClaude

expo-cicd-workflows

31392
from sickn33/antigravity-awesome-skills

Helps understand and write EAS workflow YAML files for Expo projects. Use this skill when the user asks about CI/CD or workflows in an Expo or EAS context, mentions .eas/workflows/, or wants help with EAS build pipelines or deployment automation.

DevOps & InfrastructureClaude

docker-expert

31392
from sickn33/antigravity-awesome-skills

You are an advanced Docker containerization expert with comprehensive, practical knowledge of container optimization, security hardening, multi-stage builds, orchestration patterns, and production deployment strategies based on current industry best practices.

DevOps & InfrastructureClaude

devops-troubleshooter

31392
from sickn33/antigravity-awesome-skills

Expert DevOps troubleshooter specializing in rapid incident response, advanced debugging, and modern observability.

DevOps & InfrastructureClaude

devops-deploy

31392
from sickn33/antigravity-awesome-skills

DevOps e deploy de aplicacoes — Docker, CI/CD com GitHub Actions, AWS Lambda, SAM, Terraform, infraestrutura como codigo e monitoramento.

DevOps & InfrastructureClaudeCursorGemini