monitoring
Production health check, uptime monitoring, performance metrics. DevOps engineer agent için monitoring best practices.
Best use case
monitoring is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Production health check, uptime monitoring, performance metrics. DevOps engineer agent için monitoring best practices.
Teams using monitoring should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/monitoring/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How monitoring Compares
| Feature / Agent | monitoring | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Production health check, uptime monitoring, performance metrics. DevOps engineer agent için monitoring best practices.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Monitoring Skill
Bu skill, devops-engineer agent'ın sistemleri izlemesi ve sağlık kontrolü yapması için kullanılır.
---
## 🎯 Monitoring Prensipleri
### USE Method (Utilization, Saturation, Errors)
```
┌─────────────────────────────────────────┐
│ RESOURCE │ U │ S │ E │
├─────────────────────────────────────────┤
│ CPU │ 75% │ 0.5 │ 0 │
│ Memory │ 60% │ 0.1 │ 0 │
│ Disk │ 40% │ 0.0 │ 2 errors │
│ Network │ 30% │ 0.0 │ 1 timeout│
└─────────────────────────────────────────┘
```
### RED Method (Rate, Errors, Duration)
```
┌─────────────────────────────────────────┐
│ ENDPOINT │ R │ E │ D │
├─────────────────────────────────────────┤
│ GET /api/users │ 150/s │ 0% │ 50ms│
│ POST /api/auth │ 20/s │ 2% │ 200ms│
│ GET /api/orders │ 80/s │ 0.5%│ 120ms│
└─────────────────────────────────────────┘
```
---
## 📊 Kritik Metrikler
### 1. System Metrics
```bash
# CPU kullanımı
top -bn1 | grep "Cpu(s)"
# Memory kullanımı
free -h
# Disk kullanımı
df -h
# Disk I/O
iostat -x 1 5
# Network
netstat -i
```
---
### 2. Application Metrics
#### Response Time (Latency)
```typescript
// Middleware ile ölç
app.use((req, res, next) => {
const start = Date.now()
res.on('finish', () => {
const duration = Date.now() - start
metrics.recordResponseTime(req.path, duration)
// p95 > 500ms ise alert
if (duration > 500) {
logger.warn('Slow response', { path: req.path, duration })
}
})
next()
})
```
#### Error Rate
```typescript
let totalRequests = 0
let errorRequests = 0
app.use((req, res, next) => {
totalRequests++
res.on('finish', () => {
if (res.statusCode >= 500) {
errorRequests++
}
const errorRate = (errorRequests / totalRequests) * 100
// Error rate > %1 ise critical
if (errorRate > 1) {
alerting.sendCritical('High error rate', { rate: errorRate })
}
})
next()
})
```
#### Throughput
```typescript
// Request per second
const requestsPerMinute = []
setInterval(() => {
const rpm = requestsPerMinute.reduce((a, b) => a + b, 0)
const rps = rpm / 60
metrics.record('requests_per_second', rps)
requestsPerMinute = []
}, 60000)
```
---
### 3. Database Metrics
```sql
-- Slow queries (PostgreSQL)
SELECT
query,
mean_exec_time,
calls
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Connection count
SELECT count(*) FROM pg_stat_activity;
-- Database size
SELECT pg_size_pretty(pg_database_size('mydb'));
```
```bash
# MongoDB metrics
mongo --eval "db.serverStatus().connections"
mongo --eval "db.stats()"
```
---
## 🚨 Alerting Stratejisi
### Alert Seviyeleri
| Seviye | Threshold | Aksiyon |
|--------|-----------|---------|
| **INFO** | Normal olay | Log'a yaz |
| **WARNING** | Potansiyel sorun | Slack notification |
| **ERROR** | Önemli hata | Email + Slack |
| **CRITICAL** | Sistem çökmek üzere | PagerDuty + Phone call |
---
### Örnek Alert Rules
```yaml
# Prometheus alert rules
groups:
- name: app_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: HighResponseTime
expr: http_request_duration_seconds{quantile="0.95"} > 0.5
for: 10m
labels:
severity: warning
- alert: HighMemoryUsage
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
for: 5m
labels:
severity: critical
```
---
## 🏥 Health Check Endpoints
### Liveness Check
```typescript
// /health/live - Servis ayakta mı?
app.get('/health/live', (req, res) => {
res.status(200).json({ status: 'alive', timestamp: Date.now() })
})
```
### Readiness Check
```typescript
// /health/ready - Servis trafiğe hazır mı?
app.get('/health/ready', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
externalAPI: await checkExternalAPI()
}
const allHealthy = Object.values(checks).every(check => check.healthy)
res.status(allHealthy ? 200 : 503).json({
status: allHealthy ? 'ready' : 'not_ready',
checks,
timestamp: Date.now()
})
})
async function checkDatabase() {
try {
await db.query('SELECT 1')
return { healthy: true }
} catch (error) {
return { healthy: false, error: error.message }
}
}
```
### Startup Check
```typescript
// /health/startup - İlk başlatma tamamlandı mı?
let isStartupComplete = false
app.get('/health/startup', (req, res) => {
if (isStartupComplete) {
res.status(200).json({ status: 'started' })
} else {
res.status(503).json({ status: 'starting' })
}
})
// Startup tamamlandığında
async function bootstrap() {
await initializeDatabase()
await warmupCache()
await loadConfiguration()
isStartupComplete = true
}
```
---
## 📈 Performance Monitoring
### Golden Signals
```
1. LATENCY - Ne kadar hızlı?
2. TRAFFIC - Ne kadar talep var?
3. ERRORS - Ne kadar başarısız?
4. SATURATION - Kaynaklar dolu mu?
```
### Node.js Specific Metrics
```typescript
import v8 from 'v8'
import process from 'process'
function getNodeMetrics() {
const heapStats = v8.getHeapStatistics()
const memUsage = process.memoryUsage()
return {
// Heap kullanımı
heap_total: heapStats.total_heap_size,
heap_used: heapStats.used_heap_size,
heap_limit: heapStats.heap_size_limit,
// Memory
rss: memUsage.rss, // Resident Set Size
heap_total_mb: Math.round(memUsage.heapTotal / 1024 / 1024),
heap_used_mb: Math.round(memUsage.heapUsed / 1024 / 1024),
// Event Loop Lag
event_loop_lag: getEventLoopLag(),
// Uptime
uptime_seconds: process.uptime(),
// CPU
cpu_usage: process.cpuUsage()
}
}
// Event loop lag measurement
let lastCheck = Date.now()
function getEventLoopLag() {
const now = Date.now()
const lag = now - lastCheck - 1000 // Expected 1000ms
lastCheck = now
return lag
}
setInterval(getEventLoopLag, 1000)
```
---
## 🔍 Log Monitoring
### Structured Logging
```typescript
// ✅ Structured log (JSON)
logger.info({
message: 'User login',
userId: '123',
ip: req.ip,
userAgent: req.headers['user-agent'],
timestamp: new Date().toISOString(),
duration: 250
})
// Log aggregation ile kolay query:
// "Show me all logins from userId=123 in last hour"
```
### Log Levels
```typescript
const logger = createLogger({
level: process.env.LOG_LEVEL || 'info'
})
logger.error('Critical error', { error }) // Always logged
logger.warn('Warning', { context }) // Production
logger.info('User action', { userId }) // Production
logger.debug('Variable value', { value }) // Development only
logger.trace('Function call', { args }) // Development only
```
### Log Sampling
```typescript
// High-traffic endpoint'lerde her log'u yazma
const shouldLog = Math.random() < 0.1 // %10 sample
if (shouldLog) {
logger.info('Request processed', { path: req.path })
}
```
---
## 🛠️ Monitoring Tools & Integration
### Sentry Integration (MCP)
```typescript
// Sentry MCP ile error tracking
import { SentryMCP } from '@mcp/sentry'
app.use((err, req, res, next) => {
// Error'ı Sentry'ye gönder
SentryMCP.captureException(err, {
user: { id: req.userId },
tags: { endpoint: req.path },
extra: { body: req.body }
})
res.status(500).json({ error: 'Internal server error' })
})
```
### Prometheus Metrics
```typescript
import { register, Counter, Histogram } from 'prom-client'
// Counter
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status']
})
// Histogram (latency)
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'path'],
buckets: [0.1, 0.5, 1, 2, 5]
})
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.end(await register.metrics())
})
```
---
## 📊 Dashboard Örneği
### Minimal Monitoring Dashboard
```typescript
// /dashboard/metrics endpoint
app.get('/dashboard/metrics', async (req, res) => {
const metrics = {
system: {
uptime: process.uptime(),
memory: process.memoryUsage(),
cpu: process.cpuUsage()
},
application: {
total_requests: totalRequests,
error_rate: (errorRequests / totalRequests * 100).toFixed(2) + '%',
avg_response_time: calculateAvgResponseTime() + 'ms'
},
database: {
active_connections: await getDbConnections(),
slow_queries: await getSlowQueries()
},
alerts: {
active: await getActiveAlerts(),
recent: await getRecentAlerts(24) // Last 24h
}
}
res.json(metrics)
})
```
---
## 🚀 Production Monitoring Checklist
Session başında kontrol et:
- [ ] Health check endpoint'leri çalışıyor mu?
- [ ] Log aggregation sistemi aktif mi?
- [ ] Error tracking (Sentry) kurulu mu?
- [ ] Alerting rules tanımlı mı?
- [ ] Metrics endpoint expose edilmiş mi?
- [ ] Dashboard erişilebilir mi?
- [ ] Backup monitoring çalışıyor mu?
- [ ] SSL certificate expiry izleniyor mu?
---
## 🔔 Alert Response Playbook
### Critical Alert Geldiğinde
```bash
# 1. Durumu doğrula
curl https://myapp.com/health/ready
# 2. Son log'ları kontrol et
tail -f -n 200 /var/log/app/error.log
# 3. Resource kullanımı
top -bn1
free -h
df -h
# 4. Service durumu
systemctl status myapp
docker ps
# 5. Son değişiklikleri gör
git log --oneline -10
# 6. Gerekirse rollback
git revert HEAD
./deploy.sh
# 7. Incident report yaz
# docs/incidents/YYYY-MM-DD-incident.md
```
---
## 📝 Monitoring Best Practices
### DO ✅
- **Baseline oluştur** - Normal değerleri bil
- **Trend analizi** - Zaman içinde nasıl değişiyor?
- **Alert fatigue önle** - Çok alert kötü alert
- **SLA tanımla** - %99.9 uptime hedefle
- **Regular review** - Dashboard'u haftada 1 gözden geçir
- **Documentation** - Alert playbook yaz
### DON'T ❌
- **Reactive monitoring** - Sadece hata olunca bakma
- **Metric overload** - 100 metric > 10 critical metric
- **Silent failures** - Error'ları yutma
- **Production debugging** - Monitor et, debug etme
- **Ignore warnings** - Warning bugün, critical yarın
---
## 🎯 SLI/SLO/SLA
### Service Level Indicators (SLI)
```
Availability = (Successful Requests / Total Requests) × 100
Latency p95 = 95th percentile response time
Error Rate = (Failed Requests / Total Requests) × 100
```
### Service Level Objectives (SLO)
```
Target Availability: 99.9% (43.2 min downtime/month)
Target p95 Latency: < 200ms
Target Error Rate: < 0.1%
```
### Service Level Agreements (SLA)
```
Guaranteed Availability: 99.5%
If < 99.5%: 10% service credit
If < 99.0%: 25% service credit
```
---
## 🔗 Monitoring Stack Örnekleri
### Stack 1: Open Source
```
Prometheus → Metric collection
Grafana → Visualization
AlertManager → Alerting
Loki → Log aggregation
Jaeger → Distributed tracing
```
### Stack 2: Cloud Native
```
CloudWatch (AWS) → Metrics + Logs
Datadog → APM + Monitoring
Sentry → Error tracking
PagerDuty → On-call alerting
```
### Stack 3: Minimal (MCP)
```
Sentry MCP → Error tracking
Custom /metrics endpoint → Prometheus scrape
GitHub Actions → Uptime monitoring
Slack → Alerting
```
---
## 📚 Kaynaklar
- [SRE Book - Google](https://sre.google/sre-book/table-of-contents/)
- [Prometheus Best Practices](https://prometheus.io/docs/practices/naming/)
- [The Four Golden Signals](https://sre.google/sre-book/monitoring-distributed-systems/)
---
**Son Güncelleme:** 2026-01-26
**Kullanıcı:** devops-engineer agent
**İlgili Skill:** error-recovery, debuggingRelated Skills
observability-monitoring-performance-engineer
Expert performance engineer specializing in modern observability, application optimization, and scalable system performance. Masters OpenTelemetry, distributed tracing, load testing, multi-tier caching, Core Web Vitals, and performance monitoring. Handles end-to-end optimization, real user monitoring, and scalability patterns. Use PROACTIVELY for performance optimization, observability, or scalability challenges. Use when: the task directly matches performance engineer responsibilities within plugin observability-monitoring. Do not use when: a more specific framework or task-focused skill is clearly a better match.
blazemeter-api-monitoring
Comprehensive guide for BlazeMeter API Monitoring, including test creation, configuration, scripting, integrations, notifications, and management. Use when working with API Monitoring tests for (1) Creating and configuring API tests, (2) Writing custom scripts (Initial, Pre-request, Post-response), (3) Integrating with third-party services (Slack, PagerDuty, Datadog, etc.), (4) Managing teams, buckets, and RBAC, (5) Configuring notifications and sharing results, (6) Using test data (CSV, Data Entities), (7) Advanced features (GraphQL, SOAP, file uploads, environments), or any other API Monitoring tasks.
sentry-setup-ai-monitoring
Setup Sentry AI Agent Monitoring in any project. Use this when asked to add AI monitoring, track LLM calls, monitor AI agents, or instrument OpenAI/Anthropic/Vercel AI/LangChain/Google GenAI. Automatically detects installed AI SDKs and configures the appropriate Sentry integration.
Data Quality Monitoring
Data Quality (DQ) Monitoring is the continuous process of validating data against predefined rules and expectations. In a modern data stack, monitoring must happen at every stage: **Ingestion**, **Tra
apify-brand-reputation-monitoring
Track reviews, ratings, sentiment, and brand mentions across Google Maps, Booking.com, TripAdvisor, Facebook, Instagram, YouTube, and TikTok. Use when user asks to monitor brand reputation, analyze...
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
poetry-rye-dependency-management
Specifies Poetry or Rye for dependency management in Python projects.
podcastfy-clawdbot
Generate an AI podcast (MP3) from one or more URLs using the open-source Podcastfy project. Use when the user says “make a podcast from this URL/article/video/PDF”, “turn this webpage into a podcast”, or wants an MP3 conversation-style summary from links. Uses Gemini for transcript generation via GEMINI_API_KEY and Edge TTS for free voice.
plugin-patterns
Canvas plugin architecture patterns, best practices, and implementation templates
playwright-skill
Complete browser automation with Playwright. Auto-detects dev servers, writes clean test scripts to /tmp. Test pages, fill forms, take screenshots, check responsive design, validate UX, test login flows, check links, automate any browser task. Use when user wants to test websites, automate browser interactions, validate web functionality, or perform any browser-based testing.
playwright-core
Battle-tested Playwright patterns for E2E, API, component, visual, accessibility, and security testing. Covers locators, assertions, fixtures, network mocking, auth flows, debugging, and framework recipes for React, Next.js, Vue, and Angular. TypeScript and JavaScript.
playwright-app-testing
Test the Expensify App using Playwright browser automation. Use when user requests browser testing, after making frontend changes, or when debugging UI issues