azure-resource-health-diagnose
Analyze Azure resource health, diagnose issues from logs and telemetry, and create a remediation plan for identified problems.
Best use case
azure-resource-health-diagnose is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Analyze Azure resource health, diagnose issues from logs and telemetry, and create a remediation plan for identified problems.
Teams using azure-resource-health-diagnose should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/azure-resource-health-diagnose/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How azure-resource-health-diagnose Compares
| Feature / Agent | azure-resource-health-diagnose | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Analyze Azure resource health, diagnose issues from logs and telemetry, and create a remediation plan for identified problems.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Azure Resource Health & Issue Diagnosis
This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.
## Prerequisites
- Azure MCP server configured and authenticated
- Target Azure resource identified (name and optionally resource group/subscription)
- Resource must be deployed and running to generate logs/telemetry
- Prefer Azure MCP tools (`azmcp-*`) over direct Azure CLI when available
## Workflow Steps
### Step 1: Get Azure Best Practices
**Action**: Retrieve diagnostic and troubleshooting best practices
**Tools**: Azure MCP best practices tool
**Process**:
1. **Load Best Practices**:
- Execute Azure best practices tool to get diagnostic guidelines
- Focus on health monitoring, log analysis, and issue resolution patterns
- Use these practices to inform diagnostic approach and remediation recommendations
### Step 2: Resource Discovery & Identification
**Action**: Locate and identify the target Azure resource
**Tools**: Azure MCP tools + Azure CLI fallback
**Process**:
1. **Resource Lookup**:
- If only resource name provided: Search across subscriptions using `azmcp-subscription-list`
- Use `az resource list --name <resource-name>` to find matching resources
- If multiple matches found, prompt user to specify subscription/resource group
- Gather detailed resource information:
- Resource type and current status
- Location, tags, and configuration
- Associated services and dependencies
2. **Resource Type Detection**:
- Identify resource type to determine appropriate diagnostic approach:
- **Web Apps/Function Apps**: Application logs, performance metrics, dependency tracking
- **Virtual Machines**: System logs, performance counters, boot diagnostics
- **Cosmos DB**: Request metrics, throttling, partition statistics
- **Storage Accounts**: Access logs, performance metrics, availability
- **SQL Database**: Query performance, connection logs, resource utilization
- **Application Insights**: Application telemetry, exceptions, dependencies
- **Key Vault**: Access logs, certificate status, secret usage
- **Service Bus**: Message metrics, dead letter queues, throughput
### Step 3: Health Status Assessment
**Action**: Evaluate current resource health and availability
**Tools**: Azure MCP monitoring tools + Azure CLI
**Process**:
1. **Basic Health Check**:
- Check resource provisioning state and operational status
- Verify service availability and responsiveness
- Review recent deployment or configuration changes
- Assess current resource utilization (CPU, memory, storage, etc.)
2. **Service-Specific Health Indicators**:
- **Web Apps**: HTTP response codes, response times, uptime
- **Databases**: Connection success rate, query performance, deadlocks
- **Storage**: Availability percentage, request success rate, latency
- **VMs**: Boot diagnostics, guest OS metrics, network connectivity
- **Functions**: Execution success rate, duration, error frequency
### Step 4: Log & Telemetry Analysis
**Action**: Analyze logs and telemetry to identify issues and patterns
**Tools**: Azure MCP monitoring tools for Log Analytics queries
**Process**:
1. **Find Monitoring Sources**:
- Use `azmcp-monitor-workspace-list` to identify Log Analytics workspaces
- Locate Application Insights instances associated with the resource
- Identify relevant log tables using `azmcp-monitor-table-list`
2. **Execute Diagnostic Queries**:
Use `azmcp-monitor-log-query` with targeted KQL queries based on resource type:
**General Error Analysis**:
```kql
// Recent errors and exceptions
union isfuzzy=true
AzureDiagnostics,
AppServiceHTTPLogs,
AppServiceAppLogs,
AzureActivity
| where TimeGenerated > ago(24h)
| where Level == "Error" or ResultType != "Success"
| summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
```
**Performance Analysis**:
```kql
// Performance degradation patterns
Perf
| where TimeGenerated > ago(7d)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
| where avg_CounterValue > 80
```
**Application-Specific Queries**:
```kql
// Application Insights - Failed requests
requests
| where timestamp > ago(24h)
| where success == false
| summarize FailureCount=count() by resultCode, bin(timestamp, 1h)
| order by timestamp desc
// Database - Connection failures
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.SQL"
| where Category == "SQLSecurityAuditEvents"
| where action_name_s == "CONNECTION_FAILED"
| summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)
```
3. **Pattern Recognition**:
- Identify recurring error patterns or anomalies
- Correlate errors with deployment times or configuration changes
- Analyze performance trends and degradation patterns
- Look for dependency failures or external service issues
### Step 5: Issue Classification & Root Cause Analysis
**Action**: Categorize identified issues and determine root causes
**Process**:
1. **Issue Classification**:
- **Critical**: Service unavailable, data loss, security breaches
- **High**: Performance degradation, intermittent failures, high error rates
- **Medium**: Warnings, suboptimal configuration, minor performance issues
- **Low**: Informational alerts, optimization opportunities
2. **Root Cause Analysis**:
- **Configuration Issues**: Incorrect settings, missing dependencies
- **Resource Constraints**: CPU/memory/disk limitations, throttling
- **Network Issues**: Connectivity problems, DNS resolution, firewall rules
- **Application Issues**: Code bugs, memory leaks, inefficient queries
- **External Dependencies**: Third-party service failures, API limits
- **Security Issues**: Authentication failures, certificate expiration
3. **Impact Assessment**:
- Determine business impact and affected users/systems
- Evaluate data integrity and security implications
- Assess recovery time objectives and priorities
### Step 6: Generate Remediation Plan
**Action**: Create a comprehensive plan to address identified issues
**Process**:
1. **Immediate Actions** (Critical issues):
- Emergency fixes to restore service availability
- Temporary workarounds to mitigate impact
- Escalation procedures for complex issues
2. **Short-term Fixes** (High/Medium issues):
- Configuration adjustments and resource scaling
- Application updates and patches
- Monitoring and alerting improvements
3. **Long-term Improvements** (All issues):
- Architectural changes for better resilience
- Preventive measures and monitoring enhancements
- Documentation and process improvements
4. **Implementation Steps**:
- Prioritized action items with specific Azure CLI commands
- Testing and validation procedures
- Rollback plans for each change
- Monitoring to verify issue resolution
### Step 7: User Confirmation & Report Generation
**Action**: Present findings and get approval for remediation actions
**Process**:
1. **Display Health Assessment Summary**:
```
🏥 Azure Resource Health Assessment
📊 Resource Overview:
• Resource: [Name] ([Type])
• Status: [Healthy/Warning/Critical]
• Location: [Region]
• Last Analyzed: [Timestamp]
🚨 Issues Identified:
• Critical: X issues requiring immediate attention
• High: Y issues affecting performance/reliability
• Medium: Z issues for optimization
• Low: N informational items
🔍 Top Issues:
1. [Issue Type]: [Description] - Impact: [High/Medium/Low]
2. [Issue Type]: [Description] - Impact: [High/Medium/Low]
3. [Issue Type]: [Description] - Impact: [High/Medium/Low]
🛠️ Remediation Plan:
• Immediate Actions: X items
• Short-term Fixes: Y items
• Long-term Improvements: Z items
• Estimated Resolution Time: [Timeline]
❓ Proceed with detailed remediation plan? (y/n)
```
2. **Generate Detailed Report**:
```markdown
# Azure Resource Health Report: [Resource Name]
**Generated**: [Timestamp]
**Resource**: [Full Resource ID]
**Overall Health**: [Status with color indicator]
## 🔍 Executive Summary
[Brief overview of health status and key findings]
## 📊 Health Metrics
- **Availability**: X% over last 24h
- **Performance**: [Average response time/throughput]
- **Error Rate**: X% over last 24h
- **Resource Utilization**: [CPU/Memory/Storage percentages]
## 🚨 Issues Identified
### Critical Issues
- **[Issue 1]**: [Description]
- **Root Cause**: [Analysis]
- **Impact**: [Business impact]
- **Immediate Action**: [Required steps]
### High Priority Issues
- **[Issue 2]**: [Description]
- **Root Cause**: [Analysis]
- **Impact**: [Performance/reliability impact]
- **Recommended Fix**: [Solution steps]
## 🛠️ Remediation Plan
### Phase 1: Immediate Actions (0-2 hours)
```bash
# Critical fixes to restore service
[Azure CLI commands with explanations]
```
### Phase 2: Short-term Fixes (2-24 hours)
```bash
# Performance and reliability improvements
[Azure CLI commands with explanations]
```
### Phase 3: Long-term Improvements (1-4 weeks)
```bash
# Architectural and preventive measures
[Azure CLI commands and configuration changes]
```
## 📈 Monitoring Recommendations
- **Alerts to Configure**: [List of recommended alerts]
- **Dashboards to Create**: [Monitoring dashboard suggestions]
- **Regular Health Checks**: [Recommended frequency and scope]
## ✅ Validation Steps
- [ ] Verify issue resolution through logs
- [ ] Confirm performance improvements
- [ ] Test application functionality
- [ ] Update monitoring and alerting
- [ ] Document lessons learned
## 📝 Prevention Measures
- [Recommendations to prevent similar issues]
- [Process improvements]
- [Monitoring enhancements]
```
## Error Handling
- **Resource Not Found**: Provide guidance on resource name/location specification
- **Authentication Issues**: Guide user through Azure authentication setup
- **Insufficient Permissions**: List required RBAC roles for resource access
- **No Logs Available**: Suggest enabling diagnostic settings and waiting for data
- **Query Timeouts**: Break down analysis into smaller time windows
- **Service-Specific Issues**: Provide generic health assessment with limitations noted
## Success Criteria
- ✅ Resource health status accurately assessed
- ✅ All significant issues identified and categorized
- ✅ Root cause analysis completed for major problems
- ✅ Actionable remediation plan with specific steps provided
- ✅ Monitoring and prevention recommendations included
- ✅ Clear prioritization of issues by business impact
- ✅ Implementation steps include validation and rollback proceduresRelated Skills
terraform-azurerm-set-diff-analyzer
Analyze Terraform plan JSON output for AzureRM Provider to distinguish between false-positive diffs (order-only changes in Set-type attributes) and actual resource changes. Use when reviewing terraform plan output for Azure resources like Application Gateway, Load Balancer, Firewall, Front Door, NSG, and other resources with Set-type attributes that cause spurious diffs due to internal ordering changes.
azure-servicebus-dotnet
Azure Service Bus SDK for .NET. Enterprise messaging with queues, topics, subscriptions, and sessions.
azure-search-documents-ts
Build search applications using Azure AI Search SDK for JavaScript (@azure/search-documents). Use when creating/managing indexes, implementing vector/hybrid search, semantic ranking, or building ag...
azure-search-documents-py
Azure AI Search SDK for Python. Use for vector search, hybrid search, semantic ranking, indexing, and skillsets.
azure-search-documents-dotnet
Azure AI Search SDK for .NET (Azure.Search.Documents). Use for building search applications with full-text, vector, semantic, and hybrid search.
azure-role-selector
When user is asking for guidance for which role to assign to an identity given desired permissions, this agent helps them understand the role that will meet the requirements with least privilege access and how to apply that role.
azure-resource-visualizer
Analyze Azure resource groups and generate detailed Mermaid architecture diagrams showing the relationships between individual resources. Use this skill when the user asks for a diagram of their Azure resources or help in understanding how the resources relate to each other.
azure-resource-manager-sql-dotnet
Azure Resource Manager SDK for Azure SQL in .NET.
azure-resource-manager-redis-dotnet
Azure Resource Manager SDK for Redis in .NET.
azure-resource-manager-postgresql-dotnet
Azure PostgreSQL Flexible Server SDK for .NET. Database management for PostgreSQL Flexible Server deployments.
azure-resource-manager-playwright-dotnet
Azure Resource Manager SDK for Microsoft Playwright Testing in .NET.
azure-resource-manager-mysql-dotnet
Azure MySQL Flexible Server SDK for .NET. Database management for MySQL Flexible Server deployments.