grafana-dashboards
Create and manage production-ready Grafana dashboards for comprehensive system observability.
About this skill
The `grafana-dashboards` AI agent skill empowers users to efficiently design, generate, and manage production-ready Grafana dashboards. Leveraging the agent's capabilities, this skill guides users through creating comprehensive system observability solutions, visualizing metrics from various data sources (e.g., Prometheus, InfluxDB, PostgreSQL, cloud monitoring services). It focuses on applying best practices for dashboard design, ensuring clarity, performance, and actionable insights for monitoring applications, infrastructure, and other key systems. The agent can assist in defining goals, selecting appropriate metrics, suggesting visualization types, and even generating dashboard configuration snippets, making it an invaluable tool for DevOps engineers, SREs, and developers.
Best use case
Setting up new monitoring dashboards for applications or infrastructure components. Optimizing existing Grafana dashboards for better performance, clarity, or new metrics. Troubleshooting system performance issues by generating specific diagnostic dashboards. Consolidating various metrics into a unified view of system health. Automating the creation of standard dashboard templates for new services. Getting guidance on best practices for Grafana dashboard design and configuration.
Create and manage production-ready Grafana dashboards for comprehensive system observability.
A well-structured and functional Grafana dashboard configuration (e.g., JSON model). Actionable insights derived from clearly visualized application and infrastructure metrics. Improved system observability, allowing for quicker identification and resolution of issues. Guidance on Grafana dashboard best practices, panel types, and query construction. Verified steps or recommendations for deploying and validating the new or updated dashboard.
Practical example
Example input
I need a new Grafana dashboard for monitoring my Kubernetes cluster. Specifically, I want to track CPU utilization, memory usage, network I/O, and pod restarts across namespaces. My metrics are exposed via Prometheus.
Example output
Okay, here's a structured approach for your Kubernetes monitoring dashboard:
1. **Define Goals:** What specific issues are you trying to identify (e.g., resource hogs, unstable pods)?
2. **Key Metrics & Queries (Prometheus examples):**
* CPU Usage: `sum(rate(container_cpu_usage_seconds_total{image!=""}[5m])) by (pod, namespace)`
* Memory Usage: `sum(container_memory_working_set_bytes{image!=""}) by (pod, namespace)`
* Network I/O: `sum(rate(container_network_receive_bytes_total{image!=""}[5m])) by (pod, namespace)`
* Pod Restarts: `sum(kube_pod_container_status_restarts_total) by (pod, namespace)`
3. **Suggested Panels:** Graph for time-series trends, Stat for current values (e.g., total restarts), Table for detailed pod/namespace views.
4. **Best Practices:** Use template variables for namespaces and pods, set appropriate refresh rates, configure alerts for critical thresholds.
I can provide a JSON model snippet based on these specifications, or guide you through creating it step-by-step in Grafana.When to use this skill
- When you need to visualize metrics from diverse data sources in a single, coherent view.
- When deploying a new service and requiring a dedicated monitoring dashboard.
- When existing dashboards are unclear, slow, or lack critical metrics.
- When seeking expert guidance on Grafana features, panels, or query optimization.
When not to use this skill
- The task is entirely unrelated to Grafana dashboards or system observability.
- You need to interact with a different monitoring or data visualization tool outside the scope of Grafana.
- The primary goal is simply data extraction or manipulation without a visualization requirement in Grafana.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/grafana-dashboards/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How grafana-dashboards Compares
| Feature / Agent | grafana-dashboards | Standard Approach |
|---|---|---|
| Platform Support | Claude | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | easy | N/A |
Frequently Asked Questions
What does this skill do?
Create and manage production-ready Grafana dashboards for comprehensive system observability.
Which AI agents support this skill?
This skill is designed for Claude.
How difficult is it to install?
The installation complexity is rated as easy. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
SKILL.md Source
# Grafana Dashboards
Create and manage production-ready Grafana dashboards for comprehensive system observability.
## Do not use this skill when
- The task is unrelated to grafana dashboards
- You need a different domain or tool outside this scope
## Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.
## Purpose
Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.
## Use this skill when
- Visualize Prometheus metrics
- Create custom dashboards
- Implement SLO dashboards
- Monitor infrastructure
- Track business KPIs
## Dashboard Design Principles
### 1. Hierarchy of Information
```
┌─────────────────────────────────────┐
│ Critical Metrics (Big Numbers) │
├─────────────────────────────────────┤
│ Key Trends (Time Series) │
├─────────────────────────────────────┤
│ Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘
```
### 2. RED Method (Services)
- **Rate** - Requests per second
- **Errors** - Error rate
- **Duration** - Latency/response time
### 3. USE Method (Resources)
- **Utilization** - % time resource is busy
- **Saturation** - Queue length/wait time
- **Errors** - Error count
## Dashboard Structure
### API Monitoring Dashboard
```json
{
"dashboard": {
"title": "API Monitoring",
"tags": ["api", "production"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
],
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
},
{
"title": "Error Rate %",
"type": "graph",
"targets": [
{
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "Error Rate"
}
],
"alert": {
"conditions": [
{
"evaluator": {"params": [5], "type": "gt"},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"type": "query"
}
]
},
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}
},
{
"title": "P95 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}}"
}
],
"gridPos": {"x": 0, "y": 8, "w": 24, "h": 8}
}
]
}
}
```
**Reference:** See `assets/api-dashboard.json`
## Panel Types
### 1. Stat Panel (Single Value)
```json
{
"type": "stat",
"title": "Total Requests",
"targets": [{
"expr": "sum(http_requests_total)"
}],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "value"
},
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"value": 0, "color": "green"},
{"value": 80, "color": "yellow"},
{"value": 90, "color": "red"}
]
}
}
}
}
```
### 2. Time Series Graph
```json
{
"type": "graph",
"title": "CPU Usage",
"targets": [{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}],
"yaxes": [
{"format": "percent", "max": 100, "min": 0},
{"format": "short"}
]
}
```
### 3. Table Panel
```json
{
"type": "table",
"title": "Service Status",
"targets": [{
"expr": "up",
"format": "table",
"instant": true
}],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {"Time": true},
"indexByName": {},
"renameByName": {
"instance": "Instance",
"job": "Service",
"Value": "Status"
}
}
}
]
}
```
### 4. Heatmap
```json
{
"type": "heatmap",
"title": "Latency Heatmap",
"targets": [{
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap"
}],
"dataFormat": "tsbuckets",
"yAxis": {
"format": "s"
}
}
```
## Variables
### Query Variables
```json
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"multi": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
"refresh": 1,
"multi": true
}
]
}
}
```
### Use Variables in Queries
```
sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))
```
## Alerts in Dashboards
```json
{
"alert": {
"name": "High Error Rate",
"conditions": [
{
"evaluator": {
"params": [5],
"type": "gt"
},
"operator": {"type": "and"},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {"type": "avg"},
"type": "query"
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "1m",
"message": "Error rate is above 5%",
"noDataState": "no_data",
"notifications": [
{"uid": "slack-channel"}
]
}
}
```
## Dashboard Provisioning
**dashboards.yml:**
```yaml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'General'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/dashboards
```
## Common Dashboard Patterns
### Infrastructure Dashboard
**Key Panels:**
- CPU utilization per node
- Memory usage per node
- Disk I/O
- Network traffic
- Pod count by namespace
- Node status
**Reference:** See `assets/infrastructure-dashboard.json`
### Database Dashboard
**Key Panels:**
- Queries per second
- Connection pool usage
- Query latency (P50, P95, P99)
- Active connections
- Database size
- Replication lag
- Slow queries
**Reference:** See `assets/database-dashboard.json`
### Application Dashboard
**Key Panels:**
- Request rate
- Error rate
- Response time (percentiles)
- Active users/sessions
- Cache hit rate
- Queue length
## Best Practices
1. **Start with templates** (Grafana community dashboards)
2. **Use consistent naming** for panels and variables
3. **Group related metrics** in rows
4. **Set appropriate time ranges** (default: Last 6 hours)
5. **Use variables** for flexibility
6. **Add panel descriptions** for context
7. **Configure units** correctly
8. **Set meaningful thresholds** for colors
9. **Use consistent colors** across dashboards
10. **Test with different time ranges**
## Dashboard as Code
### Terraform Provisioning
```hcl
resource "grafana_dashboard" "api_monitoring" {
config_json = file("${path.module}/dashboards/api-monitoring.json")
folder = grafana_folder.monitoring.id
}
resource "grafana_folder" "monitoring" {
title = "Production Monitoring"
}
```
### Ansible Provisioning
```yaml
- name: Deploy Grafana dashboards
copy:
src: "{{ item }}"
dest: /etc/grafana/dashboards/
with_fileglob:
- "dashboards/*.json"
notify: restart grafana
```
## Reference Files
- `assets/api-dashboard.json` - API monitoring dashboard
- `assets/infrastructure-dashboard.json` - Infrastructure dashboard
- `assets/database-dashboard.json` - Database monitoring dashboard
- `references/dashboard-design.md` - Dashboard design guide
## Related Skills
- `prometheus-configuration` - For metric collection
- `slo-implementation` - For SLO dashboardsRelated Skills
manifest
Install and configure the Manifest observability plugin for your agents. Use when setting up telemetry, configuring API keys, or troubleshooting the plugin.
distributed-tracing
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
azure-monitor-opentelemetry-py
Azure Monitor OpenTelemetry Distro for Python. Use for one-line Application Insights setup with auto-instrumentation.
azure-monitor-opentelemetry-exporter-py
Azure Monitor OpenTelemetry Exporter for Python. Use for low-level OpenTelemetry export to Application Insights.
azure-monitor-ingestion-py
Azure Monitor Ingestion SDK for Python. Use for sending custom logs to Log Analytics workspace via Logs Ingestion API.
nft-standards
Master ERC-721 and ERC-1155 NFT standards, metadata best practices, and advanced NFT features.
nextjs-app-router-patterns
Comprehensive patterns for Next.js 14+ App Router architecture, Server Components, and modern full-stack React development.
new-rails-project
Create a new Rails project
networkx
NetworkX is a Python package for creating, manipulating, and analyzing complex networks and graphs.
network-engineer
Expert network engineer specializing in modern cloud networking, security architectures, and performance optimization.
nestjs-expert
You are an expert in Nest.js with deep knowledge of enterprise-grade Node.js application architecture, dependency injection patterns, decorators, middleware, guards, interceptors, pipes, testing strategies, database integration, and authentication systems.
nerdzao-elite
Senior Elite Software Engineer (15+) and Senior Product Designer. Full workflow with planning, architecture, TDD, clean code, and pixel-perfect UX validation.