alerting-and-monitoring

Define alerts, escalation, and incident response.

16 stars

Best use case

alerting-and-monitoring is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Define alerts, escalation, and incident response.

Teams using alerting-and-monitoring should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/alerting-and-monitoring/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/devops/alerting-and-monitoring/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/alerting-and-monitoring/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How alerting-and-monitoring Compares

Feature / Agentalerting-and-monitoringStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Define alerts, escalation, and incident response.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Alerting And Monitoring

## Purpose
- Define alerts, escalation, and incident response.

## Preconditions
- Access to system context (repos, infra, environments)
- Confirmed requirements and constraints
- Required approvals for security, compliance, or governance

## Inputs
- Problem statement and scope
- Current architecture or system constraints
- Non-functional requirements (performance, security, compliance)
- Target stack and environment

## Outputs
- Design or implementation plan
- Required artifacts (diagrams, configs, specs, checklists)
- Validation steps and acceptance criteria

## Detailed Step-by-Step Procedures
1. Clarify scope, constraints, and success metrics.
2. Review current system state, dependencies, and integration points.
3. Select patterns, tools, and architecture options that match constraints.
4. Produce primary artifacts (docs/specs/configs/code stubs).
5. Validate against requirements and known risks.
6. Provide rollout and rollback guidance.

## Decision Trees and Conditional Logic
- If compliance or regulatory scope applies -> add required controls and audit steps.
- If latency budget is strict -> choose low-latency storage and caching.
- Else -> prefer cost-optimized storage and tiering.
- If data consistency is critical -> prefer transactional boundaries and strong consistency.
- Else -> evaluate eventual consistency or async processing.

## Error Handling and Edge Cases
- Partial failures across dependencies -> isolate blast radius and retry with backoff.
- Data corruption or loss risk -> enable backups and verify restore path.
- Limited access to systems -> document gaps and request access early.
- Legacy dependencies with limited change tolerance -> use adapters and phased rollout.

## Tool Requirements and Dependencies
- CLI and SDK tooling for the target stack
- Credentials or access tokens for required environments
- Diagramming or spec tooling when producing docs

## Stack Profiles
- Use Profile A, B, or C from `skills/STACK_PROFILES.md`.
- Note selected profile in outputs for traceability.

## Validation
- Requirements coverage check
- Security and compliance review
- Performance and reliability review
- Peer or stakeholder sign-off

## Rollback Procedures
- Revert config or deployment to last known good state.
- Roll back database migrations if applicable.
- Verify service health, data integrity, and error rates after rollback.

## Success Metrics
- Measurable outcomes (latency, error rate, uptime, cost)
- Acceptance thresholds defined with stakeholders

## Example Workflows and Use Cases
- Minimal: apply the skill to a small service or single module.
- Production: apply the skill to a multi-service or multi-tenant system.

Related Skills

alerting

16
from diegosouzapw/awesome-omni-skill

Real-time alerting and notification system for Univers infrastructure. Use this when you need to monitor system health, service status, and send proactive alerts when thresholds are exceeded or services fail.

alerting-rules-agent

16
from diegosouzapw/awesome-omni-skill

Designs and configures alerting rules for monitoring systems

observability-monitoring-performance-engineer

16
from diegosouzapw/awesome-omni-skill

Expert performance engineer specializing in modern observability, application optimization, and scalable system performance. Masters OpenTelemetry, distributed tracing, load testing, multi-tier caching, Core Web Vitals, and performance monitoring. Handles end-to-end optimization, real user monitoring, and scalability patterns. Use PROACTIVELY for performance optimization, observability, or scalability challenges. Use when: the task directly matches performance engineer responsibilities within plugin observability-monitoring. Do not use when: a more specific framework or task-focused skill is clearly a better match.

monitoring

16
from diegosouzapw/awesome-omni-skill

Production health check, uptime monitoring, performance metrics. DevOps engineer agent için monitoring best practices.

blazemeter-api-monitoring

16
from diegosouzapw/awesome-omni-skill

Comprehensive guide for BlazeMeter API Monitoring, including test creation, configuration, scripting, integrations, notifications, and management. Use when working with API Monitoring tests for (1) Creating and configuring API tests, (2) Writing custom scripts (Initial, Pre-request, Post-response), (3) Integrating with third-party services (Slack, PagerDuty, Datadog, etc.), (4) Managing teams, buckets, and RBAC, (5) Configuring notifications and sharing results, (6) Using test data (CSV, Data Entities), (7) Advanced features (GraphQL, SOAP, file uploads, environments), or any other API Monitoring tasks.

sentry-setup-ai-monitoring

16
from diegosouzapw/awesome-omni-skill

Setup Sentry AI Agent Monitoring in any project. Use this when asked to add AI monitoring, track LLM calls, monitor AI agents, or instrument OpenAI/Anthropic/Vercel AI/LangChain/Google GenAI. Automatically detects installed AI SDKs and configures the appropriate Sentry integration.

Data Quality Monitoring

16
from diegosouzapw/awesome-omni-skill

Data Quality (DQ) Monitoring is the continuous process of validating data against predefined rules and expectations. In a modern data stack, monitoring must happen at every stage: **Ingestion**, **Tra

apify-brand-reputation-monitoring

16
from diegosouzapw/awesome-omni-skill

Track reviews, ratings, sentiment, and brand mentions across Google Maps, Booking.com, TripAdvisor, Facebook, Instagram, YouTube, and TikTok. Use when user asks to monitor brand reputation, analyze...

bgo

10
from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

go-microservices

16
from diegosouzapw/awesome-omni-skill

Production-ready Go microservices patterns including Gin, Echo, gRPC, clean architecture, dependency injection, error handling, middleware, testing, Docker containerization, Kubernetes deployment, distributed tracing, observability with Prometheus, high-performance APIs, concurrent processing, database integration with GORM, Redis caching, message queues, and cloud-native best practices.

go-expert

16
from diegosouzapw/awesome-omni-skill

Expert guidance for Go (Golang) development following industry best practices. Use when writing Go code, reviewing PRs, bootstrapping new services, configuring linters, implementing observability, or troubleshooting Go applications. Covers SOLID principles, Gang of Four design patterns, domain-driven structure, error handling, context patterns, concurrency, testing, structured logging, health endpoints, and CI gates.

gke-deployment

16
from diegosouzapw/awesome-omni-skill

Deploy, configure, and manage Kubernetes workloads on GKE with Deployments, Services, Ingress, HPA, health probes, ConfigMaps, and Secrets. Use when deploying containers to GKE, configuring load balancers, setting up autoscaling, writing health checks, managing environment configs, or troubleshooting pod issues.