grafana-dashboards

Create and manage production-ready Grafana dashboards for comprehensive system observability.

31,392 stars
Complexity: easy

About this skill

The `grafana-dashboards` AI agent skill empowers users to efficiently design, generate, and manage production-ready Grafana dashboards. Leveraging the agent's capabilities, this skill guides users through creating comprehensive system observability solutions, visualizing metrics from various data sources (e.g., Prometheus, InfluxDB, PostgreSQL, cloud monitoring services). It focuses on applying best practices for dashboard design, ensuring clarity, performance, and actionable insights for monitoring applications, infrastructure, and other key systems. The agent can assist in defining goals, selecting appropriate metrics, suggesting visualization types, and even generating dashboard configuration snippets, making it an invaluable tool for DevOps engineers, SREs, and developers.

Best use case

Setting up new monitoring dashboards for applications or infrastructure components. Optimizing existing Grafana dashboards for better performance, clarity, or new metrics. Troubleshooting system performance issues by generating specific diagnostic dashboards. Consolidating various metrics into a unified view of system health. Automating the creation of standard dashboard templates for new services. Getting guidance on best practices for Grafana dashboard design and configuration.

Create and manage production-ready Grafana dashboards for comprehensive system observability.

A well-structured and functional Grafana dashboard configuration (e.g., JSON model). Actionable insights derived from clearly visualized application and infrastructure metrics. Improved system observability, allowing for quicker identification and resolution of issues. Guidance on Grafana dashboard best practices, panel types, and query construction. Verified steps or recommendations for deploying and validating the new or updated dashboard.

Practical example

Example input

I need a new Grafana dashboard for monitoring my Kubernetes cluster. Specifically, I want to track CPU utilization, memory usage, network I/O, and pod restarts across namespaces. My metrics are exposed via Prometheus.

Example output

Okay, here's a structured approach for your Kubernetes monitoring dashboard:
1.  **Define Goals:** What specific issues are you trying to identify (e.g., resource hogs, unstable pods)?
2.  **Key Metrics & Queries (Prometheus examples):**
    *   CPU Usage: `sum(rate(container_cpu_usage_seconds_total{image!=""}[5m])) by (pod, namespace)`
    *   Memory Usage: `sum(container_memory_working_set_bytes{image!=""}) by (pod, namespace)`
    *   Network I/O: `sum(rate(container_network_receive_bytes_total{image!=""}[5m])) by (pod, namespace)`
    *   Pod Restarts: `sum(kube_pod_container_status_restarts_total) by (pod, namespace)`
3.  **Suggested Panels:** Graph for time-series trends, Stat for current values (e.g., total restarts), Table for detailed pod/namespace views.
4.  **Best Practices:** Use template variables for namespaces and pods, set appropriate refresh rates, configure alerts for critical thresholds.
I can provide a JSON model snippet based on these specifications, or guide you through creating it step-by-step in Grafana.

When to use this skill

  • When you need to visualize metrics from diverse data sources in a single, coherent view.
  • When deploying a new service and requiring a dedicated monitoring dashboard.
  • When existing dashboards are unclear, slow, or lack critical metrics.
  • When seeking expert guidance on Grafana features, panels, or query optimization.

When not to use this skill

  • The task is entirely unrelated to Grafana dashboards or system observability.
  • You need to interact with a different monitoring or data visualization tool outside the scope of Grafana.
  • The primary goal is simply data extraction or manipulation without a visualization requirement in Grafana.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/grafana-dashboards/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/grafana-dashboards/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/grafana-dashboards/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How grafana-dashboards Compares

Feature / Agentgrafana-dashboardsStandard Approach
Platform SupportClaudeLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

Create and manage production-ready Grafana dashboards for comprehensive system observability.

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Grafana Dashboards

Create and manage production-ready Grafana dashboards for comprehensive system observability.

## Do not use this skill when

- The task is unrelated to grafana dashboards
- You need a different domain or tool outside this scope

## Instructions

- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.

## Purpose

Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.

## Use this skill when

- Visualize Prometheus metrics
- Create custom dashboards
- Implement SLO dashboards
- Monitor infrastructure
- Track business KPIs

## Dashboard Design Principles

### 1. Hierarchy of Information
```
┌─────────────────────────────────────┐
│  Critical Metrics (Big Numbers)     │
├─────────────────────────────────────┤
│  Key Trends (Time Series)           │
├─────────────────────────────────────┤
│  Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘
```

### 2. RED Method (Services)
- **Rate** - Requests per second
- **Errors** - Error rate
- **Duration** - Latency/response time

### 3. USE Method (Resources)
- **Utilization** - % time resource is busy
- **Saturation** - Queue length/wait time
- **Errors** - Error count

## Dashboard Structure

### API Monitoring Dashboard

```json
{
  "dashboard": {
    "title": "API Monitoring",
    "tags": ["api", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
      },
      {
        "title": "Error Rate %",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error Rate"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {"params": [5], "type": "gt"},
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "type": "query"
            }
          ]
        },
        "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}
      },
      {
        "title": "P95 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": {"x": 0, "y": 8, "w": 24, "h": 8}
      }
    ]
  }
}
```

**Reference:** See `assets/api-dashboard.json`

## Panel Types

### 1. Stat Panel (Single Value)
```json
{
  "type": "stat",
  "title": "Total Requests",
  "targets": [{
    "expr": "sum(http_requests_total)"
  }],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value"
  },
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"value": 0, "color": "green"},
          {"value": 80, "color": "yellow"},
          {"value": 90, "color": "red"}
        ]
      }
    }
  }
}
```

### 2. Time Series Graph
```json
{
  "type": "graph",
  "title": "CPU Usage",
  "targets": [{
    "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
  }],
  "yaxes": [
    {"format": "percent", "max": 100, "min": 0},
    {"format": "short"}
  ]
}
```

### 3. Table Panel
```json
{
  "type": "table",
  "title": "Service Status",
  "targets": [{
    "expr": "up",
    "format": "table",
    "instant": true
  }],
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": {"Time": true},
        "indexByName": {},
        "renameByName": {
          "instance": "Instance",
          "job": "Service",
          "Value": "Status"
        }
      }
    }
  ]
}
```

### 4. Heatmap
```json
{
  "type": "heatmap",
  "title": "Latency Heatmap",
  "targets": [{
    "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
    "format": "heatmap"
  }],
  "dataFormat": "tsbuckets",
  "yAxis": {
    "format": "s"
  }
}
```

## Variables

### Query Variables
```json
{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 1,
        "multi": false
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
        "refresh": 1,
        "multi": true
      }
    ]
  }
}
```

### Use Variables in Queries
```
sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))
```

## Alerts in Dashboards

```json
{
  "alert": {
    "name": "High Error Rate",
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "operator": {"type": "and"},
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": {"type": "avg"},
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "1m",
    "message": "Error rate is above 5%",
    "noDataState": "no_data",
    "notifications": [
      {"uid": "slack-channel"}
    ]
  }
}
```

## Dashboard Provisioning

**dashboards.yml:**
```yaml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: 'General'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards
```

## Common Dashboard Patterns

### Infrastructure Dashboard

**Key Panels:**
- CPU utilization per node
- Memory usage per node
- Disk I/O
- Network traffic
- Pod count by namespace
- Node status

**Reference:** See `assets/infrastructure-dashboard.json`

### Database Dashboard

**Key Panels:**
- Queries per second
- Connection pool usage
- Query latency (P50, P95, P99)
- Active connections
- Database size
- Replication lag
- Slow queries

**Reference:** See `assets/database-dashboard.json`

### Application Dashboard

**Key Panels:**
- Request rate
- Error rate
- Response time (percentiles)
- Active users/sessions
- Cache hit rate
- Queue length

## Best Practices

1. **Start with templates** (Grafana community dashboards)
2. **Use consistent naming** for panels and variables
3. **Group related metrics** in rows
4. **Set appropriate time ranges** (default: Last 6 hours)
5. **Use variables** for flexibility
6. **Add panel descriptions** for context
7. **Configure units** correctly
8. **Set meaningful thresholds** for colors
9. **Use consistent colors** across dashboards
10. **Test with different time ranges**

## Dashboard as Code

### Terraform Provisioning

```hcl
resource "grafana_dashboard" "api_monitoring" {
  config_json = file("${path.module}/dashboards/api-monitoring.json")
  folder      = grafana_folder.monitoring.id
}

resource "grafana_folder" "monitoring" {
  title = "Production Monitoring"
}
```

### Ansible Provisioning

```yaml
- name: Deploy Grafana dashboards
  copy:
    src: "{{ item }}"
    dest: /etc/grafana/dashboards/
  with_fileglob:
    - "dashboards/*.json"
  notify: restart grafana
```

## Reference Files

- `assets/api-dashboard.json` - API monitoring dashboard
- `assets/infrastructure-dashboard.json` - Infrastructure dashboard
- `assets/database-dashboard.json` - Database monitoring dashboard
- `references/dashboard-design.md` - Dashboard design guide

## Related Skills

- `prometheus-configuration` - For metric collection
- `slo-implementation` - For SLO dashboards

Related Skills

manifest

31392
from sickn33/antigravity-awesome-skills

Install and configure the Manifest observability plugin for your agents. Use when setting up telemetry, configuring API keys, or troubleshooting the plugin.

Observability & MonitoringClaude

distributed-tracing

31392
from sickn33/antigravity-awesome-skills

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

Observability & MonitoringClaude

azure-monitor-opentelemetry-py

31392
from sickn33/antigravity-awesome-skills

Azure Monitor OpenTelemetry Distro for Python. Use for one-line Application Insights setup with auto-instrumentation.

Observability & MonitoringClaude

azure-monitor-opentelemetry-exporter-py

31392
from sickn33/antigravity-awesome-skills

Azure Monitor OpenTelemetry Exporter for Python. Use for low-level OpenTelemetry export to Application Insights.

Observability & MonitoringClaude

azure-monitor-ingestion-py

31392
from sickn33/antigravity-awesome-skills

Azure Monitor Ingestion SDK for Python. Use for sending custom logs to Log Analytics workspace via Logs Ingestion API.

Observability & MonitoringClaude

nft-standards

31392
from sickn33/antigravity-awesome-skills

Master ERC-721 and ERC-1155 NFT standards, metadata best practices, and advanced NFT features.

Web3 & BlockchainClaude

nextjs-app-router-patterns

31392
from sickn33/antigravity-awesome-skills

Comprehensive patterns for Next.js 14+ App Router architecture, Server Components, and modern full-stack React development.

Web FrameworksClaude

new-rails-project

31392
from sickn33/antigravity-awesome-skills

Create a new Rails project

Code GenerationClaude

networkx

31392
from sickn33/antigravity-awesome-skills

NetworkX is a Python package for creating, manipulating, and analyzing complex networks and graphs.

Network AnalysisClaude

network-engineer

31392
from sickn33/antigravity-awesome-skills

Expert network engineer specializing in modern cloud networking, security architectures, and performance optimization.

Network EngineeringClaude

nestjs-expert

31392
from sickn33/antigravity-awesome-skills

You are an expert in Nest.js with deep knowledge of enterprise-grade Node.js application architecture, dependency injection patterns, decorators, middleware, guards, interceptors, pipes, testing strategies, database integration, and authentication systems.

Frameworks & LibrariesClaude

nerdzao-elite

31392
from sickn33/antigravity-awesome-skills

Senior Elite Software Engineer (15+) and Senior Product Designer. Full workflow with planning, architecture, TDD, clean code, and pixel-perfect UX validation.

Software DevelopmentClaude