prometheus-patterns

PromQL queries, alerting rules, recording rules, Grafana dashboard JSON, SLO

422 stars

Best use case

prometheus-patterns is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

PromQL queries, alerting rules, recording rules, Grafana dashboard JSON, SLO

Teams using prometheus-patterns should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/prometheus-patterns/SKILL.md --create-dirs "https://raw.githubusercontent.com/vibeeval/vibecosystem/main/skills/prometheus-patterns/skill.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/prometheus-patterns/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How prometheus-patterns Compares

Feature / Agentprometheus-patternsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

PromQL queries, alerting rules, recording rules, Grafana dashboard JSON, SLO

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Prometheus Patterns

## PromQL Essentials

### Rate and Error Calculations

```promql
# Request rate (per second, 5m window)
rate(http_requests_total[5m])

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# P99 latency from histogram
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# P50 latency by endpoint
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)
)

# Saturation: CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
/ sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod) * 100
```

### SLO: Error Budget

```promql
# SLO: 99.9% availability over 30 days
# Error budget = 0.1% = 43.2 minutes/month

# Current burn rate (how fast consuming budget)
1 - (
  sum(rate(http_requests_total{status!~"5.."}[1h]))
  / sum(rate(http_requests_total[1h]))
) / (1 - 0.999)

# Remaining error budget (percentage)
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  / (sum(increase(http_requests_total[30d])) * 0.001)
)
```

## Recording Rules

```yaml
groups:
  - name: sli_rules
    interval: 30s
    rules:
      - record: job:http_request_rate:5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      - record: job:http_error_rate:5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job)

      - record: job:http_latency_p99:5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )
```

## Alerting Rules

```yaml
groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_error_rate:5m > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for {{ $labels.job }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      - alert: HighLatency
        expr: job:http_latency_p99:5m > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 500ms for {{ $labels.job }}"

      - alert: ErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > 14.4 * 0.001
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Burning error budget 14.4x faster than allowed"
```

## Instrumentation (Go)

```go
import "github.com/prometheus/client_golang/prometheus"

var (
    httpRequests = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "handler", "status"},
    )
    httpDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
        },
        []string{"method", "handler"},
    )
)
```

## Checklist

- [ ] RED metrics (Rate, Errors, Duration) for every service
- [ ] Recording rules for expensive queries
- [ ] Alerts have runbook links
- [ ] Error budget alerts with multi-window burn rate
- [ ] Histogram buckets match expected latency distribution
- [ ] Labels have low cardinality (no user IDs, request IDs)
- [ ] Grafana dashboards use recording rules, not raw queries
- [ ] Alert severity matches response SLA

## Anti-Patterns

- High cardinality labels (user_id, trace_id) causing metric explosion
- Using `avg()` for latency instead of histograms/quantiles
- Missing `for` clause in alerts causing alert storms
- Recording rules with too short intervals wasting resources
- Alerting on symptoms without linking to causes
- Not setting meaningful histogram bucket boundaries

Related Skills

websocket-patterns

422
from vibeeval/vibecosystem

Connection management, room patterns, reconnection strategies, message buffering, and binary protocol design.

vector-db-patterns

422
from vibeeval/vibecosystem

Embedding strategies, ANN algorithms, hybrid search, RAG chunking strategies, and reranking for semantic search and retrieval.

tracing-patterns

422
from vibeeval/vibecosystem

OpenTelemetry setup, span context propagation, sampling strategies, Jaeger queries

terraform-patterns

422
from vibeeval/vibecosystem

Module composition, state management, workspace strategy, provider versioning, and infrastructure-as-code best practices.

swift-patterns

422
from vibeeval/vibecosystem

SwiftUI view composition, @Observable patterns, async/await concurrency, TCA architecture, and Combine reactive streams.

springboot-patterns

422
from vibeeval/vibecosystem

Spring Boot architecture patterns, REST API design, layered services, data access, caching, async processing, and logging. Use for Java Spring Boot backend work.

seo-patterns

422
from vibeeval/vibecosystem

Meta tag patterns, structured data (JSON-LD), Core Web Vitals optimization, and SSR/SSG strategies for search visibility.

secret-patterns

422
from vibeeval/vibecosystem

30+ service-specific secret detection regex patterns, entropy-based detection, PEM/JWT/Base64 identification, and false positive filtering.

saas-payment-patterns

422
from vibeeval/vibecosystem

Payment provider abstraction, webhook security, subscription lifecycle, dunning flows, pricing models, invoicing, tax handling, and refund patterns for SaaS applications.

saas-auth-patterns

422
from vibeeval/vibecosystem

SaaS authentication and authorization patterns including JWT vs session strategies, multi-tenant isolation, RBAC, API key management, passwordless flows, MFA, and secure session handling.

saas-analytics-patterns

422
from vibeeval/vibecosystem

SaaS analytics event taxonomy, metric formulas (MRR, churn, LTV), provider-agnostic tracking, funnel analysis, cohort setup, and privacy-respecting instrumentation.

revenuecat-patterns

422
from vibeeval/vibecosystem

RevenueCat SDK entegrasyon pattern'leri. iOS (Swift), Android (Kotlin), React Native ve Flutter icin setup, offerings, entitlement checking, webhook integration, StoreKit 2 migration ve sandbox testing.