deploying-monitoring-stacks

Monitor use when deploying monitoring stacks including Prometheus, Grafana, and Datadog. Trigger with phrases like "deploy monitoring stack", "setup prometheus", "configure grafana", or "install datadog agent". Generates production-ready configurations with metric collection, visualization dashboards, and alerting rules.

1,868 stars

byjeremylongshore

View on GitHub Installation ↓

Best use case

deploying-monitoring-stacks is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using deploying-monitoring-stacks should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/deploying-monitoring-stacks/SKILL.md --create-dirs "https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/main/plugins/devops/monitoring-stack-deployer/skills/deploying-monitoring-stacks/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/deploying-monitoring-stacks/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How deploying-monitoring-stacks Compares

Feature / Agent	deploying-monitoring-stacks	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

SKILL.md Source

# Deploying Monitoring Stacks

## Overview

Deploy production monitoring stacks (Prometheus + Grafana, Datadog, or Victoria Metrics) with metric collection, custom dashboards, and alerting rules. Configure exporters, scrape targets, recording rules, and notification channels for comprehensive infrastructure and application observability.

## Prerequisites

- Target infrastructure identified: Kubernetes cluster, Docker hosts, or bare-metal servers
- Metric endpoints accessible from the monitoring platform (application `/metrics`, node exporters)
- Storage backend capacity planned for time-series data (Prometheus TSDB, Thanos, or Cortex for long-term)
- Alert notification channels defined: Slack webhook, PagerDuty integration key, or email SMTP
- Helm 3+ for Kubernetes deployments using kube-prometheus-stack or similar charts

## Instructions

1. Select the monitoring platform: Prometheus + Grafana for open-source self-hosted, Datadog for managed SaaS, Victoria Metrics for high-cardinality workloads
2. Deploy the monitoring stack: `helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack` or Docker Compose for non-Kubernetes
3. Install exporters on monitored systems: node-exporter for host metrics, kube-state-metrics for Kubernetes object states, application-specific exporters
4. Configure scrape targets in `prometheus.yml`: define job names, scrape intervals, and relabeling rules for service discovery
5. Create recording rules for frequently queried aggregations to reduce dashboard query load
6. Define alerting rules with meaningful thresholds: high CPU (>80% for 5m), high memory (>90%), error rate (>1%), latency P99 (>500ms)
7. Configure Alertmanager with routing, grouping, and notification channels (Slack, PagerDuty, email)
8. Build Grafana dashboards: RED metrics (Rate, Errors, Duration) for services, USE metrics (Utilization, Saturation, Errors) for resources
9. Set up data retention: configure TSDB retention period (15-30 days local), set up Thanos/Cortex for long-term storage if needed
10. Test the full pipeline: trigger a test alert and verify notification delivery

## Output

- Helm values file or Docker Compose for the monitoring stack
- Prometheus configuration with scrape targets, recording rules, and alerting rules
- Alertmanager configuration with routing tree and notification receivers
- Grafana dashboard JSON files for infrastructure and application metrics
- Exporter deployment manifests (node-exporter DaemonSet, application ServiceMonitor)

## Error Handling

| Error | Cause | Solution |
|-------|-------|---------|
| `No data points in dashboard` | Scrape target not reachable or metric name wrong | Check `Targets` page in Prometheus UI; verify service discovery and metric name |
| `Too many time series (high cardinality)` | Labels with unbounded values (user IDs, request IDs) | Remove high-cardinality labels with `metric_relabel_configs`; use recording rules for aggregation |
| `Alert condition met but no notification` | Alertmanager routing or receiver misconfigured | Verify Alertmanager config with `amtool check-config`; test receiver with `amtool silence` |
| `Prometheus OOMKilled` | Insufficient memory for series count | Increase memory limits; reduce scrape targets or retention; add WAL compression |
| `Grafana datasource connection failed` | Wrong Prometheus URL or network policy blocking access | Verify datasource URL in Grafana; check Kubernetes service name and port; review network policies |

## Examples

- "Deploy kube-prometheus-stack on Kubernetes with alerts for node CPU > 80%, pod restart count > 5, and API error rate > 1%, sending to Slack."
- "Set up Prometheus + Grafana on Docker Compose for monitoring 10 application servers with node-exporter and custom application metrics."
- "Create Grafana dashboards for the four golden signals (latency, traffic, errors, saturation) for a microservices application."

## Resources

- Prometheus documentation: https://prometheus.io/docs/
- Grafana documentation: https://grafana.com/docs/grafana/latest/
- kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
- Alerting best practices: https://prometheus.io/docs/practices/alerting/
- Datadog documentation: https://docs.datadoghq.com/

Related Skills

klingai-job-monitoring

1868

from jeremylongshore/claude-code-plugins-plus-skills

Track and monitor Kling AI video generation task status. Use when building dashboards, tracking batch jobs, or debugging stuck tasks. Trigger with phrases like 'klingai job status', 'kling ai monitor', 'track klingai task', 'klingai progress'.

setting-up-synthetic-monitoring

1868

from jeremylongshore/claude-code-plugins-plus-skills

Setup synthetic monitoring for proactive performance tracking including uptime checks, transaction monitoring, and API health. Use when implementing availability monitoring or tracking critical user journeys. Trigger with phrases like "setup synthetic monitoring", "monitor uptime", or "configure health checks".

implementing-real-user-monitoring

1868

from jeremylongshore/claude-code-plugins-plus-skills

Implement Real User Monitoring (RUM) to capture actual user performance data including Core Web Vitals and page load times. Use when setting up user experience monitoring or tracking custom performance events. Trigger with phrases like "setup RUM", "track Core Web Vitals", or "monitor real user performance".

monitoring-error-rates

1868

from jeremylongshore/claude-code-plugins-plus-skills

Monitor and analyze application error rates to improve reliability. Use when tracking errors in applications including HTTP errors, exceptions, and database issues. Trigger with phrases like "monitor error rates", "track application errors", or "analyze error patterns".

monitoring-cpu-usage

1868

from jeremylongshore/claude-code-plugins-plus-skills

Monitor this skill enables AI assistant to monitor and analyze cpu usage patterns within applications. it helps identify cpu hotspots, analyze algorithmic complexity, and detect blocking operations. use this skill when the user asks to "monitor cpu usage", "opt... Use when setting up monitoring or observability. Trigger with phrases like 'monitor', 'metrics', or 'alerts'.

monitoring-database-transactions

1868

from jeremylongshore/claude-code-plugins-plus-skills

Monitor use when you need to work with monitoring and observability. This skill provides health monitoring and alerting with comprehensive guidance and automation. Trigger with phrases like "monitor system health", "set up alerts", or "track metrics".

monitoring-database-health

1868

from jeremylongshore/claude-code-plugins-plus-skills

monitoring-whale-activity

1868

from jeremylongshore/claude-code-plugins-plus-skills

Track large cryptocurrency transactions and whale wallet movements in real-time. Use when tracking large holder movements, exchange flows, or wallet activity. Trigger with phrases like "track whales", "monitor large transfers", "check whale activity", "exchange inflows", or "watch wallet".

monitoring-cross-chain-bridges

1868

from jeremylongshore/claude-code-plugins-plus-skills

Monitor cross-chain bridge TVL, volume, fees, and transaction status across networks. Use when researching bridges, comparing routes, or tracking bridge transactions. Trigger with phrases like "monitor bridges", "compare bridge fees", "track bridge tx", "bridge TVL", or "cross-chain transfer status".

monitoring-apis

1868

from jeremylongshore/claude-code-plugins-plus-skills

Build real-time API monitoring dashboards with metrics, alerts, and health checks. Use when tracking API health and performance metrics. Trigger with phrases like "monitor the API", "add API metrics", or "setup API monitoring".

deploying-machine-learning-models

1868

from jeremylongshore/claude-code-plugins-plus-skills

Deploy this skill enables AI assistant to deploy machine learning models to production environments. it automates the deployment workflow, implements best practices for serving models, optimizes performance, and handles potential errors. use this skill when th... Use when deploying or managing infrastructure. Trigger with phrases like 'deploy', 'infrastructure', or 'CI/CD'.

cloud-monitoring-alert

1868

from jeremylongshore/claude-code-plugins-plus-skills

Cloud Monitoring Alert - Auto-activating skill for GCP Skills. Triggers on: cloud monitoring alert, cloud monitoring alert Part of the GCP Skills skill category.