alerting

Real-time alerting and notification system for Univers infrastructure. Use this when you need to monitor system health, service status, and send proactive alerts when thresholds are exceeded or services fail.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

alerting is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using alerting should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/alerting/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/devops/alerting/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/alerting/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How alerting Compares

Feature / Agent	alerting	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Alerting Skill

This skill provides comprehensive monitoring and alerting capabilities for the Univers infrastructure ecosystem.

## Capabilities

### 1. Real-time Monitoring
- System resource monitoring (CPU, Memory, Disk, Network)
- Service health checks (HTTP endpoints, ports, processes)
- Application-specific metrics (response times, error rates)
- Custom metric collection and aggregation

### 2. Alert Engine
- Threshold-based alerting
- Rate limiting and alert suppression
- Alert escalation policies
- Multi-condition alert rules

### 3. Notification Channels
- Email notifications with rich formatting
- Slack/Teams integration with actionable messages
- Webhook support for custom integrations
- In-app notifications and banners

### 4. Alert Management
- Alert acknowledgment and resolution
- Alert history and analytics
- Scheduled maintenance windows
- Alert rule testing and validation

### 5. Dashboards and Reports
- Real-time alert status dashboard
- Historical alert trends and analytics
- Service health overview
- Performance metrics visualization

## Common Tasks

### Basic Alert Setup
```bash
# Check system for alert conditions
alert check system

# Monitor specific services
alert monitor services

# Test notification channels
alert test channels
```

### Alert Rule Management
```bash
# List all alert rules
alert rules list

# Add new alert rule
alert rules add cpu-high --threshold 80 --duration 5m

# Update existing rule
alert rules update memory-usage --threshold 90

# Remove alert rule
alert rules remove disk-space-low
```

### Notification Configuration
```bash
# Configure email notifications
alert config email --smtp smtp.example.com --from alerts@example.com

# Configure Slack integration
alert config slack --webhook https://hooks.slack.com/... --channel #alerts

# Test notification delivery
alert test email --to admin@example.com
alert test slack --message "Test alert"
```

### Alert Operations
```bash
# View active alerts
alert status

# Acknowledge an alert
alert acknowledge CPU_HIGH_001

# Resolve an alert
alert resolve MEMORY_HIGH_003

# View alert history
alert history --last 24h
```

## Alert Rule Examples

### System Resource Alerts
```yaml
# High CPU Usage
name: cpu-high
condition: cpu_usage > 80
duration: 5m
severity: warning
message: "CPU usage is {{cpu_usage}}% on {{hostname}}"
actions:
  - type: email
    to: ops@example.com
  - type: slack
    channel: #alerts

# Critical Memory Usage
name: memory-critical
condition: memory_usage > 90
duration: 2m
severity: critical
message: "Critical memory usage: {{memory_usage}}%"
actions:
  - type: webhook
    url: https://api.pagerduty.com/incidents
```

### Service Health Alerts
```yaml
# Service Down
name: service-down
condition: service_health == 0
duration: 1m
severity: critical
message: "{{service_name}} is down on {{hostname}}"
actions:
  - type: email
    to: devops@example.com
  - type: restart
    service: "{{service_name}}"

# High Response Time
name: slow-response
condition: response_time > 2000
duration: 3m
severity: warning
message: "{{service_name}} response time: {{response_time}}ms"
actions:
  - type: slack
    channel: #performance
```

### Application-Specific Alerts
```yaml
# High Error Rate
name: high-error-rate
condition: error_rate > 5
duration: 5m
severity: warning
message: "{{application}} error rate: {{error_rate}}%"
actions:
  - type: email
    to: dev-team@example.com

# Database Connection Issues
name: db-connection-failed
condition: db_connection_status != "healthy"
duration: 30s
severity: critical
message: "Database connection failed for {{application}}"
actions:
  - type: webhook
    url: https://hooks.slack.com/...
```

## Integration Examples

### Univers Services Integration
```bash
# Monitor Univers services
alert monitor univers-services

# Check specific Univers endpoints
alert check endpoint http://localhost:3003/health --service univers-server
alert check endpoint http://localhost:6007 --service univers-ui
alert check endpoint http://localhost:5173 --service univers-web

# Monitor tmux sessions
alert monitor tmux-sessions --alert-if-missing univers-developer
```

### Container Integration
```bash
# Monitor Docker containers
alert monitor containers --include univers-*

# Check container health
alert check container univers-server
alert check container univers-ui
```

## Configuration Files

### Alert Rules Configuration
```yaml
# ~/.config/univers/alerting/rules.yaml
rules:
  - name: system-cpu-high
    type: system
    metric: cpu_usage
    operator: ">"
    threshold: 80
    duration: 5m
    severity: warning

  - name: service-unavailable
    type: service
    check: http_status
    target: "http://localhost:3003/health"
    operator: "!="
    threshold: 200
    duration: 1m
    severity: critical
```

### Notification Channels
```yaml
# ~/.config/univers/alerting/channels.yaml
channels:
  email:
    smtp_host: smtp.gmail.com
    smtp_port: 587
    username: alerts@company.com
    password: ${SMTP_PASSWORD}

  slack:
    webhook_url: ${SLACK_WEBHOOK_URL}
    default_channel: #univers-alerts

  webhook:
    endpoint: https://api.example.com/alerts
    headers:
      Authorization: "Bearer ${API_TOKEN}"
```

## Best Practices

1. **Set Meaningful Thresholds**: Avoid alert fatigue by setting realistic thresholds
2. **Use Escalation Policies**: Implement graduated alert escalation
3. **Provide Context**: Include relevant details in alert messages
4. **Test Regularly**: Verify alert rules and notification channels
5. **Document Procedures**: Maintain clear runbooks for common alerts

## Troubleshooting

### Common Issues
- **Missing Notifications**: Check channel configurations and connectivity
- **False Positives**: Review alert thresholds and conditions
- **Alert Storms**: Implement rate limiting and suppression rules
- **Slow Performance**: Optimize alert check intervals and data collection

### Debug Commands
```bash
# Check alert engine status
alert status --verbose

# Test specific rule
alert test-rule cpu-high

# Check notification delivery
alert test-notification email --to test@example.com

# View alert engine logs
alert logs --tail 100
```

## Version History

- v1.0 (2025-12-16): Initial alerting system implementation
- Basic monitoring, email notifications, and alert rules

Related Skills

alerting-rules-agent

from diegosouzapw/awesome-omni-skill

Designs and configures alerting rules for monitoring systems

alerting-and-monitoring

from diegosouzapw/awesome-omni-skill

Define alerts, escalation, and incident response.

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

go-microservices

from diegosouzapw/awesome-omni-skill

Production-ready Go microservices patterns including Gin, Echo, gRPC, clean architecture, dependency injection, error handling, middleware, testing, Docker containerization, Kubernetes deployment, distributed tracing, observability with Prometheus, high-performance APIs, concurrent processing, database integration with GORM, Redis caching, message queues, and cloud-native best practices.

go-expert

from diegosouzapw/awesome-omni-skill

Expert guidance for Go (Golang) development following industry best practices. Use when writing Go code, reviewing PRs, bootstrapping new services, configuring linters, implementing observability, or troubleshooting Go applications. Covers SOLID principles, Gang of Four design patterns, domain-driven structure, error handling, context patterns, concurrency, testing, structured logging, health endpoints, and CI gates.

gke-deployment

from diegosouzapw/awesome-omni-skill

Deploy, configure, and manage Kubernetes workloads on GKE with Deployments, Services, Ingress, HPA, health probes, ConfigMaps, and Secrets. Use when deploying containers to GKE, configuring load balancers, setting up autoscaling, writing health checks, managing environment configs, or troubleshooting pod issues.

gitops-workflow

from diegosouzapw/awesome-omni-skill

Implement GitOps workflows with ArgoCD and Flux for automated, declarative Kubernetes deployments with continuous reconciliation. Use when implementing GitOps practices, automating Kubernetes deployments, or setting up declarative infrastructure management.

gitops-principles-skill

from diegosouzapw/awesome-omni-skill

Comprehensive GitOps methodology and principles skill for cloud-native operations. Use when (1) Designing GitOps architecture for Kubernetes deployments, (2) Implementing declarative infrastructure with Git as single source of truth, (3) Setting up continuous deployment pipelines with ArgoCD/Flux/Kargo, (4) Establishing branching strategies and repository structures, (5) Troubleshooting drift, sync failures, or reconciliation issues, (6) Evaluating GitOps tooling decisions, (7) Teaching or explaining GitOps concepts and best practices, (8) Deploying ArgoCD on Azure Arc-enabled Kubernetes or AKS with workload identity. Covers the 4 pillars of GitOps (OpenGitOps), patterns, anti-patterns, tooling ecosystem, Azure Arc integration, and operational guidance.

gitops-practitioner

from diegosouzapw/awesome-omni-skill

GitOps workflows, Flux, ArgoCD, and declarative infrastructure. Activates when implementing GitOps patterns, configuring Flux or ArgoCD, managing Helm releases declaratively, or discussing drift detection and reconciliation loops.

gitlab-ci

from diegosouzapw/awesome-omni-skill

Initialize or update GitLab CI/CD pipelines for Go projects with comprehensive testing, coverage reporting, snapshot builds, and automated releases

gitlab-ci-validator

from diegosouzapw/awesome-omni-skill

Comprehensive toolkit for validating, linting, testing, and securing GitLab CI/CD pipeline configurations. Use this skill when working with GitLab CI/CD pipelines, validating pipeline syntax, debugging configuration issues, or implementing best practices.

gitlab-ci-patterns

from diegosouzapw/awesome-omni-skill

Build GitLab CI/CD pipelines with multi-stage workflows, caching, and distributed runners for scalable automation. Use when implementing GitLab CI/CD, optimizing pipeline performance, or setting up...