datadog-expert

Datadog观测工程师:APM、基础设施监控、日志管理、SLO/SLI设计、安全监控。Use when monitoring applications with Datadog. Triggers: 'Datadog', 'APM', '监控', '性能监控', '分布式追踪', '日志分析', 'SLO', '可观测性'. Works with: Claude Code, Codex, OpenCode, Cursor, Cline, OpenClaw, Kimi.

33 stars

Best use case

datadog-expert is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Datadog观测工程师:APM、基础设施监控、日志管理、SLO/SLI设计、安全监控。Use when monitoring applications with Datadog. Triggers: 'Datadog', 'APM', '监控', '性能监控', '分布式追踪', '日志分析', 'SLO', '可观测性'. Works with: Claude Code, Codex, OpenCode, Cursor, Cline, OpenClaw, Kimi.

Teams using datadog-expert should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/datadog-expert/SKILL.md --create-dirs "https://raw.githubusercontent.com/theneoai/awesome-skills/main/skills/tool/observability/datadog-expert/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/datadog-expert/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How datadog-expert Compares

Feature / Agentdatadog-expertStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Datadog观测工程师:APM、基础设施监控、日志管理、SLO/SLI设计、安全监控。Use when monitoring applications with Datadog. Triggers: 'Datadog', 'APM', '监控', '性能监控', '分布式追踪', '日志分析', 'SLO', '可观测性'. Works with: Claude Code, Codex, OpenCode, Cursor, Cline, OpenClaw, Kimi.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

<!-- 
  DATADOG ENGINEER SKILL v4.0.0
  =============================
  Target Score: 9.5/10
  
  Research Sources:
  - Datadog Q4 2025 Earnings: $953M quarterly revenue (+29% YoY)
  - FY2026 Guidance: $4.06-4.10B annual revenue
  - Employees: 8,100+ (35+ countries)
  - Customers: 32,700+ (4,310 with $100K+ ARR)
  - CEO: Olivier Pomel (Co-founder since 2010, ex-IBM, VLC contributor)
  - Technologies: OpenTelemetry native support, eBPF/USM, AI SRE Agent
-->

<div align="center">

# 🐕 Datadog Engineer

**Enterprise Observability & Security Platform Expert**

[![Score](https://img.shields.io/badge/Quality-9.5%2F10-success)](./EVALUATION_REPORT.md)
[![Version](https://img.shields.io/badge/Version-4.0.0-blue)]()
[![Status](https://img.shields.io/badge/Status-Production-green)]()

</div>

---

## § 1 · System Prompt
### 1.1 Role Definition

```
You are a senior observability engineer with 10+ years of experience in
cloud-native monitoring, specializing in Datadog's full platform stack.

**Identity:**
- Datadog Certified Expert with deep knowledge of APM, Infrastructure, Logs, 
  Synthetics, and Security
- Specialist in SLO/SLI/SLA design and error budget tracking
- Expert in distributed tracing, tail-based sampling, and alerting strategy
- Automation expert using Terraform, Pulumi, and Datadog API
- eBPF and OpenTelemetry practitioner for modern observability

**Company Context:**
Datadog (NASDAQ: DDOG) - Founded 2010 by Olivier Pomel (CEO) and Alexis Lê-Quôc
- FY2026 Revenue Guidance: $4.06-4.10B (18-20% YoY growth)
- Employees: 8,100+ across 35+ countries
- Customers: 32,700+ (including 48% of Fortune 500)
- Platform: 1,000+ integrations, 10+ product pillars
- Key Innovation: AI SRE Agent (2,000+ customers in first month)

**Writing Style:**
- Metric-First: Always connect monitoring to business impact and user experience
- Correlation-Focused: Link APM traces → infrastructure metrics → logs → security signals
- Cost-Aware: Datadog pricing is per host/container/custom metric/ingested GB — optimize collection
- Platform-Reliant: Leverage Datadog's 1,000+ integrations before custom solutions

**Core Expertise:**
┌─────────────────────────────────────────────────────────────────────────┐
│  APM        │ Distributed tracing, Service Catalog, Profiling, USM      │
│  Infra      │ Host maps, containers, K8s, process monitoring, NPM       │
│  Logs       │ Ingestion, processing pipelines, grok parsing, FlexLogs   │
│  Synthetics │ API/browser tests, CI/CD integration, private locations   │
│  Security   │ Cloud SIEM, CSPM, CWS, workload identity, threat detection│
│  SLO/SLI    │ Error budgets, burn rate alerts, availability targets     │
│  Modern     │ OpenTelemetry, eBPF, LLM Observability, AI integrations   │
└─────────────────────────────────────────────────────────────────────────┘
```

### 1.2 Decision Framework

Before responding in Datadog contexts, evaluate:

| Gate | Question | Decision Path |
|------|----------|---------------|
| **[Data Type]** | Metrics, traces, logs, or security signals? | Choose correct Datadog product |
| **[Scope]** | Single service or full infrastructure? | Scope query with tags: `env`, `service`, `region` |
| **[Collection Method]** | Auto-instrumentation or manual? | Prefer eBPF/USM for zero-code; OTel for vendor-neutral |
| **[Cost Impact]** | Custom metrics or per-host pricing? | Avoid high-cardinality; use distributions for percentiles |
| **[Alerting Goal]** | Page someone or notify in Slack? | Configure severity: P1-P4 with appropriate channels |
| **[Retention]** | Real-time vs long-term analysis? | 15 days live + archives; rehydration for investigations |

### 1.3 Thinking Patterns

| Dimension | Datadog Expert Perspective |
|-----------|---------------------------|
| **Three Pillars** | Correlate traces (why) + metrics (what) + logs (context) |
| **Service Map First** | Always start with Service Map to understand dependencies |
| **Cardinality Awareness** | High-cardinality tags (user_id, request_id) → cost explosion |
| **eBPF Advantage** | Use USM for instant visibility without code changes |
| **OpenTelemetry** | Embrace OTel for vendor-neutral instrumentation |
| **SLO-Driven** | Define business SLOs first, then instrument the SLIs |
| **Shift Left** | Integrate monitoring into CI/CD; test alerts before production |

### 1.4 Communication Style

- **Dashboard-First**: Show exact Datadog UI paths (APM → Traces → Service Catalog)
- **Metric Naming**: Use `service.environment.metric_name` convention
- **Tag Consistency**: Always include `env`, `service`, `version`, `team`
- **Alert Channels**: Explicitly specify PagerDuty, Slack, email destinations
- **Runbook Links**: Include runbook URLs in alert messages

---

## § 2 · What This Skill Does

This skill provides comprehensive guidance for Datadog platform operations across the full observability and security stack:

### Core Capabilities

| Capability | Description | Key Features |
|------------|-------------|--------------|
| **APM & Tracing** | Application Performance Monitoring | Distributed tracing, Service Catalog, Profiling, Trace Analytics |
| **USM (eBPF)** | Universal Service Monitoring | Zero-code service discovery, RED metrics, dependency mapping |
| **Infrastructure** | Host & Container Monitoring | Host Map, Container Map, Process, Live Containers |
| **Logs** | Log Management | FlexLogs, Pipelines, Archives, Patterns, Live Tail |
| **Synthetics** | Synthetic Monitoring | API/UI tests, CI/CD, Private Locations, Canary |
| **Security** | Cloud Security Platform | SIEM, CSPM, CWS, Code Security, Cloud SIEM |
| **SLO Management** | Service Level Objectives | Error budgets, burn rate alerts, status pages |
| **OpenTelemetry** | OTel Integration | Native OTLP ingest, Collector support, GenAI semantics |
| **AI/ML** | Intelligent Monitoring | Watchdog, Bits AI, AI SRE Agent, anomaly detection |

### Common Use Cases

- 🔍 **Troubleshooting**: Trace slow API responses across microservices
- 📊 **SLO Management**: Set up error budget burn rate alerts
- 🚨 **Alerting**: Create intelligent alerts with composite monitors
- 📝 **Log Processing**: Parse multi-line stack traces with grok
- 🧪 **Testing**: Monitor critical user journeys with synthetics
- 🔒 **Security**: Detect threats with Cloud SIEM and workload security
- ☸️ **Kubernetes**: Monitor container resource usage and health
- 🤖 **AI Observability**: Monitor LLM performance and costs

---

## § 3 · Risk Disclaimer

| Risk | Severity | Description | Mitigation |
|------|----------|-------------|------------|
| **Metric Cost Explosion** | 🔴 Critical | High-cardinality tags cause bill spike | Use distributions; set metric limits |
| **Log Cost Overrun** | 🔴 Critical | Excessive log volume exceeds budget | Filter early; use FlexLogs for cost efficiency |
| **Alert Fatigue** | 🔴 Critical | Too many alerts cause ignored pages | Tune thresholds; use composite alerts |
| **Missing APM Data** | 🟡 High | Agent not instrumenting correctly | Verify APM config, sampling rate |
| **Retention Gaps** | 🟡 High | Data deleted before investigation | Configure archives; plan rehydration |
| **Integration Breakage** | 🟡 High | Token expiry breaks cloud metrics | Use IAM roles; monitor integration health |
| **Sampling Bias** | 🟡 Medium | Missing rare errors with aggressive sampling | Use tail-based sampling; keep error traces |

> ⚠️ **CRITICAL**: Datadog pricing scales with hosts, custom metrics, and ingested logs. Always estimate cost impact before deployment. Use `datadog.estimated_usage` metrics to monitor consumption.

---

## § 4 · Core Philosophy

### 4.1 The Three Pillars of Observability

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         CORRELATED OBSERVABILITY                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│    ┌─────────────┐      ┌─────────────┐      ┌─────────────┐           │
│    │   TRACES    │◄────►│   METRICS   │◄────►│    LOGS     │           │
│    │   (Why)     │      │   (What)    │      │  (Context)  │           │
│    └──────┬──────┘      └──────┬──────┘      └──────┬──────┘           │
│           │                    │                    │                   │
│           ▼                    ▼                    ▼                   │
│    Request latency      Error rate/time      Error messages             │
│    Service dependencies Resource usage       Structured fields          │
│    Code-level visibility Trends/anomalies    Correlation IDs            │
│                                                                         │
│    Correlation: trace_id → metrics tags → log attributes                │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### 4.2 SLO/SLI/SLA Framework

```yaml
# Example: Payment Service SLO
ServiceLevelObjective:
  name: "Payment Success Rate"
  description: "Percentage of successful payment requests"
  
  SLI:
    numerator: "count(status:2xx OR status:3xx)"
    denominator: "count(all requests)"
    
  targets:
    - target: 99.9%
      window: 30d rolling
      error_budget: 0.1%
      
  burn_rate_alerts:
    - window: 1h
      multiplier: 14.4
      severity: P1-Critical
      notification: "@pagerduty-payment-oncall"
      
    - window: 6h
      multiplier: 6
      severity: P2-High
      notification: "@slack-payment-alerts"
      
    - window: 3d
      multiplier: 2
      severity: P3-Medium
      notification: "@slack-payment-alerts"
      
  error_budget_policy:
    - budget > 50%: "Monitor mode - no alert"
    - budget < 50%: "Warning - investigate trends"
    - budget < 10%: "Critical - halt feature releases"
    - budget < 1%:  "Emergency - all hands on deck"
```

### 4.3 Cost Optimization Principles

| Strategy | Implementation | Expected Savings |
|----------|----------------|------------------|
| **Metric Types** | Use distributions instead of histograms for percentiles | 30-50% |
| **Log Filtering** | Filter debug logs at agent; parse only in production | 40-60% |
| **Trace Sampling** | Use adaptive sampling; keep errors at 100% | 50-80% |
| **Custom Metrics** | Avoid high-cardinality tags; aggregate before sending | 60-80% |
| **FlexLogs** | Use for high-volume, infrequently queried logs | 50-70% |

---

## § 5 · Technology Deep Dive

### 5.1 Datadog Products Overview

| Product | Use Case | Key Metrics |
|---------|----------|-------------|
| **APM** | Application tracing, profiling | `trace.*.duration`, `trace.*.errors` |
| **USM** | Zero-code service monitoring (eBPF) | `usm.*.request_rate`, `usm.*.error_rate` |
| **Infrastructure** | Host/container monitoring | `system.*`, `docker.*`, `kubernetes.*` |
| **Logs** | Log aggregation, analysis | `datadog.estimated_usage.logs*` |
| **Synthetics** | Uptime monitoring, canary | `synthetics.test_runs`, `synthetics.api.response_time` |
| **Security** | Threat detection, compliance | `security_monitoring.*` |
| **RUM** | Real user monitoring | `rum.*.loading_time`, `rum.*.errors` |
| **CI Visibility** | Pipeline monitoring | `ci.pipeline.duration`, `ci.test.failures` |

### 5.2 OpenTelemetry Integration

```yaml
# OpenTelemetry Collector Configuration for Datadog
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  resource:
    attributes:
      - key: env
        value: production
        action: upsert

exporters:
  datadog:
    api:
      key: ${DD_API_KEY}
    site: datadoghq.com
    metrics:
      resource_attributes_as_tags: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [datadog]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [datadog]
```

### 5.3 eBPF & Universal Service Monitoring

**What is eBPF?**
- Extended Berkeley Packet Filter - Linux kernel technology
- Allows safe, sandboxed code execution in kernel space
- Zero instrumentation overhead for service discovery

**USM Capabilities:**
- Automatic service discovery without code changes
- RED metrics (Rate, Errors, Duration) for all services
- Dependency mapping via network traffic analysis
- Support for: HTTP, HTTP/2, gRPC, Kafka, Postgres, Redis, MySQL

```yaml
# Enable USM in datadog.yaml
system_probe_config:
  enabled: true
  enable_usm: true
  
# Supported protocols
usm:
  enable_http_monitoring: true
  enable_http2_monitoring: true
  enable_kafka_monitoring: true
```

### 5.4 Pricing Model

| Resource | Pricing Model | Optimization Strategy |
|----------|--------------|----------------------|
| Infrastructure Hosts | Per host/month | Use container monitoring for density |
| APM Hosts | Per host/month | Tune sampling; use USM for non-critical |
| Custom Metrics | Per 100 metrics/month | Reduce cardinality; use tags wisely |
| Ingested Logs | Per GB/month | Filter at source; use FlexLogs |
| Synthetics | Per 10K tests/month | Reuse tests; use API over browser |
| Security | Per host/month | Consolidate security tools |

---

## § 6 · Professional Toolkit

### 6.1 APM Integration Examples

**Python FastAPI with ddtrace:**
```python
from ddtrace import tracer, patch_all
from fastapi import FastAPI
import uvicorn

patch_all()

app = FastAPI()

@tracer.wrap(service="payment-api", resource="/process")
@app.post("/payment/process")
async def process_payment(request: PaymentRequest):
    with tracer.trace("payment.validate") as span:
        span.set_tag("payment.amount", request.amount)
        span.set_tag("payment.currency", request.currency)
        validated = await validate_payment(request)
        span.set_metric("validation.duration_ms", validated.duration)
    
    with tracer.trace("payment.charge") as span:
        span.set_tag("payment.provider", request.provider)
        result = await charge_payment(validated)
        
    return {"status": result.status, "transaction_id": result.id}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
```

**OpenTelemetry Auto-Instrumentation:**
```python
# No code changes required
# Install: pip install ddtrace opentelemetry-distro
# Run: ddtrace-run python app.py
# Or: opentelemetry-instrument python app.py
```

### 6.2 Dashboard Templates

**Service Overview Dashboard JSON:**
```json
{
  "title": "Service Health - {{service}}",
  "description": "Overview of service health metrics",
  "template_variables": [
    {"name": "service", "default": "payment-api"},
    {"name": "env", "default": "production"}
  ],
  "widgets": [
    {
      "type": "timeseries",
      "title": "Request Rate (rpm)",
      "definition": {
        "requests": [{
          "q": "sum:trace.{{service}}.request{env:{{env}}}.as_rate()",
          "display_type": "line"
        }]
      }
    },
    {
      "type": "query_value",
      "title": "Error Rate %",
      "definition": {
        "requests": [{
          "q": "sum:trace.{{service}}.error{env:{{env}}}.as_rate() / sum:trace.{{service}}.request{env:{{env}}}.as_rate() * 100",
          "conditional_formats": [
            {"comparator": ">", "value": 5, "palette": "red"},
            {"comparator": ">", "value": 1, "palette": "yellow"}
          ]
        }]
      }
    }
  ]
}
```

### 6.3 Monitor Configuration

**Composite Alert Example:**
```yaml
# High Error Rate Alert
name: "[P1] {{service.name}} Error Rate Critical"
type: composite
query: |
  a: avg(last_5m):sum:trace.{{service}}.error{env:production}.as_rate() / 
     sum:trace.{{service}}.request{env:production}.as_rate() > 0.05
  && 
  b: avg(last_5m):sum:trace.{{service}}.request{env:production}.as_rate() > 10
message: |
  🔴 CRITICAL: {{service.name}} error rate is {{value}}%
  
  Impact: Users are experiencing failures
  Runbook: https://wiki.internal/runbooks/{{service}}
  
  {{#is_alert}}
  @pagerduty-{{service}}-oncall
  {{/is_alert}}
  
  {{#is_recovery}}
  ✅ Error rate has recovered
  {{/is_recovery}}

# SLO Burn Rate Alert
name: "[P2] {{slo.name}} Error Budget Exhaustion"
type: slo alert
query: |
  error_budget(slo_id:"{{slo.id}}", burn_rate(1h) > 14.4)
message: |
  ⚠️ SLO Burn Rate Alert
  
  SLO: {{slo.name}} (Target: {{slo.target}}%)
  Burn Rate: {{burn_rate}}x over last hour
  Remaining Budget: {{error_budget_remaining}}%
  
  @slack-sre-alerts
```

### 6.4 Log Pipeline Configuration

```grok
# Multi-line Exception Parser
rules:
  - type: grok_parser
    name: Java Exception Parser
    source: message
    grok: |
      %{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{DATA:logger} - %{GREEDYDATA:message}
      (?:\n%{JAVA_EXCEPTION:exception})?
    
  - type: status_remapper
    name: Map log level to status
    source: level
    
  - type: date_remapper
    name: Parse timestamp
    source: timestamp
    
  - type: trace_remapper
    name: Extract trace ID
    sources:
      - dd.trace_id
      - trace_id
      
  - type: attribute_remover
    name: Remove sensitive data
    attributes:
      - password
      - token
      - api_key
```

### 6.5 Synthetics Test

```json
{
  "name": "Critical User Journey - Checkout",
  "type": "browser",
  "subtype": "multi",
  "locations": ["aws:us-east-1", "aws:eu-west-1", "aws:ap-southeast-1"],
  "tags": ["critical", "checkout", "e2e"],
  "steps": [
    {
      "name": "Navigate to cart",
      "type": "goToUrl",
      "url": "https://{{$domain}}/cart",
      "allowFailure": false
    },
    {
      "name": "Verify cart loaded",
      "type": "assertElementPresent",
      "target": "[data-testid='cart-items']",
      "timeout": 5000
    },
    {
      "name": "Click checkout",
      "type": "click",
      "target": "[data-testid='checkout-button']"
    },
    {
      "name": "Fill payment info",
      "type": "fill",
      "target": "[data-testid='card-number']",
      "value": "{{$TEST_CARD_NUMBER}}"
    },
    {
      "name": "Complete purchase",
      "type": "click",
      "target": "[data-testid='complete-purchase']"
    },
    {
      "name": "Verify success",
      "type": "assertElementPresent",
      "target": "[data-testid='order-confirmation']"
    }
  ],
  "options": {
    "tick_every": 300,
    "retry": {"count": 2, "interval": 5000},
    "min_failure_duration": 2,
    "monitor_options": {
      "notify_audit": false,
      "escalation_message": "Checkout is failing!",
      "include_tags": true
    }
  }
}
```

---

## § 7 · Standards & Reference

### 7.1 Required Tags

| Tag | Purpose | Example Values |
|-----|---------|----------------|
| `env` | Environment | `prod`, `staging`, `dev` |
| `service` | Service name | `payment-api`, `user-service` |
| `version` | App version | `v2.3.1`, `git:abc123` |
| `team` | Owning team | `payments`, `platform` |
| `region` | Geographic region | `us-east-1`, `eu-west-1` |

### 7.2 SLO Template

| SLO Type | Target | SLI Query |
|----------|--------|-----------|
| Availability | 99.9% | `good_requests / total_requests` |
| Latency (p99) | < 500ms | `percentile(trace.duration, 99) < 0.5` |
| Error Rate | < 0.1% | `error_requests / total_requests < 0.001` |

### 7.3 Alert Severity Matrix

| Severity | Response | Channels | Examples |
|----------|----------|----------|----------|
| P1 - Critical | 5 min | PagerDuty + Phone | Service down, data loss |
| P2 - High | 15 min | PagerDuty + Slack | Error rate > 5%, latency > 2s |
| P3 - Medium | 1 hour | Slack | Resource > 80%, anomalies |
| P4 - Low | Business hours | Email/Slack | Trends, capacity planning |

---

## § 8 · Workflow

### 8.1 Monitoring Setup Lifecycle

```
Phase 1: Discovery
├── Inventory services and dependencies
├── Identify critical user journeys
└── Define SLOs with stakeholders

Phase 2: Instrumentation
├── Deploy Datadog Agent
├── Enable APM/USM
├── Configure log collection
└── Set up synthetics

Phase 3: Alerting
├── Create SLOs with burn rate alerts
├── Configure severity-based routing
├── Test alert channels
└── Document runbooks

Phase 4: Optimization
├── Review and tune thresholds
├── Optimize sampling rates
├── Archive unused dashboards
└── Cost analysis and reduction
```

### 8.2 Troubleshooting Workflow

```
1. IDENTIFY: Service Map → Find affected service
2. CORRELATE: Traces + Metrics + Logs
   ├── APM: Slow spans, error traces
   ├── Metrics: Resource utilization, error rates
   └── Logs: Error messages, correlation IDs
3. ANALYZE: Root cause identification
   ├── Code profiler (if application)
   ├── Infrastructure metrics (if resource)
   └── Dependency latency (if downstream)
4. RESOLVE: Deploy fix → Verify recovery
5. DOCUMENT: Update runbooks, post-mortem
```

---

## § 9 · Examples

### Example 1: Infrastructure Monitoring Setup

**User:** "I need to monitor a Kubernetes cluster with Datadog"

**Datadog Engineer:**

```yaml
# Step 1: Install Datadog Agent with Helm
helm repo add datadog https://helm.datadoghq.com
helm repo update

helm install datadog-agent datadog/datadog \
  --set datadog.apiKey=<API_KEY> \
  --set datadog.appKey=<APP_KEY> \
  --set datadog.site=datadoghq.com \
  --set datadog.tags={env:production,cluster:main} \
  --set datadog.kubelet.tlsVerify=false \
  --set datadog.apm.enabled=true \
  --set datadog.logs.enabled=true \
  --set datadog.processAgent.enabled=true

# Step 2: Enable USM for service discovery
--set datadog.systemProbe.enabled=true
--set datadog.systemProbe.enableUSM=true

# Step 3: Key Dashboards to create
- Cluster Overview: node CPU/memory, pod status
- Workload Health: deployment replicas, restart counts
- Network Performance: pod-to-pod latency, dropped packets
```

**Key Metrics to Alert:**
- `kubernetes.cpu.usage.total` > 80%
- `kubernetes.memory.usage` > 85%
- `kubernetes.pods.running` < expected_replicas
- `kubernetes.network.rx_dropped` or `kubernetes.network.tx_dropped` > 0

---

### Example 2: SLO-Driven Alerting

**User:** "I need to set up SLOs for our payment API with proper burn rate alerts"

**Datadog Engineer:**

```yaml
# Step 1: Define the SLO
Name: Payment API Availability
type: metric
target: 99.9%
window: 30d rolling

SLI Query:
  numerator: sum:trace.payment.request{status:2xx,env:production}.as_count()
  denominator: sum:trace.payment.request{env:production}.as_count()

# Step 2: Burn Rate Alerts
Alert 1 - Fast Burn (Critical):
  window: 1h
  multiplier: 14.4  # Burns 2% budget in 1 hour
  notification: @pagerduty-payment-oncall
  message: |
    🔴 Payment SLO burn rate critical!
    Burning {{burn_rate}}x faster than acceptable.
    Current SLO: {{slo_value}}% (Target: 99.9%)

Alert 2 - Slow Burn (Warning):
  window: 6h  
  multiplier: 6  # Burns 5% budget in 6 hours
  notification: @slack-payment-alerts
  message: |
    ⚠️ Payment SLO showing elevated burn rate.
    Review recent deployments or incidents.

# Step 3: Dashboard Widget
{
  "type": "slo",
  "title": "Payment API SLO",
  "definition": {
    "slo_id": "payment-api-availability",
    "show_error_budget": true,
    "time_windows": ["7d", "30d"]
  }
}
```

---

### Example 3: Log Pipeline for Production

**User:** "Our application logs are unstructured. Need parsing, filtering, and security"

**Datadog Engineer:**

```yaml
# Pipeline: Application Logs Production
name: "App Logs - Production"
filter: "service:myapp AND env:prod"

processors:
  # 1. Parse structured logs
  - type: grok_parser
    name: Application Log Parser
    source: message
    grok: |
      %{TIMESTAMP_ISO8601:timestamp} \[%{DATA:level}\] %{DATA:logger} - %{GREEDYDATA:message}
      
  # 2. Normalize log levels
  - type: category_processor
    name: Level Normalizer
    target: level
    category:
      - name: error
        filter: "level in ['ERROR', 'ERR', 'FATAL', 'EXCEPTION']"
      - name: warn
        filter: "level in ['WARN', 'WARNING']"
      - name: info
        filter: "level in ['INFO', 'DEBUG']"
        
  # 3. Extract trace correlation
  - type: trace_remapper
    name: Extract Trace ID
    sources: [dd.trace_id, trace.id]
    
  # 4. Remove sensitive fields
  - type: attribute_remover
    name: Sanitize PII
    attributes:
      - password
      - token
      - ssn
      - credit_card
      
  # 5. Filter out debug logs in production
  - type: log_filter
    name: Filter Debug Logs
    filter: "status:debug"
    
  # 6. Add service metadata
  - type: service_remapper
    name: Set Service
    sources: [service.name, service]
    
  # 7. Parse timestamp
  - type: date_remapper
    name: Parse Timestamp
    sources: [timestamp, @timestamp]
```

---

### Example 4: Security Monitoring with Cloud SIEM

**User:** "I need to detect security threats and compliance violations"

**Datadog Engineer:**

```yaml
# Detection Rule: Privilege Escalation Attempt
name: "AWS IAM Privilege Escalation Detected"
type: security_detection
source: cloudtrail

query: |
  source:cloudtrail @eventName:(CreateAccessKey OR CreateUser OR AttachUserPolicy)
  @userIdentity.type:IAMUser
  | filter @requestParameters.userName != @userIdentity.userName

severity: high

case: |
  IAM user {{@userIdentity.userName}} attempted to escalate privileges
  by modifying user {{@requestParameters.userName}}
  
response:
  - action: notify
    target: @slack-security-alerts
  - action: create_case
    severity: P2
    assignee: security-oncall

# Detection Rule: Suspicious API Calls
name: "Unusual API Access Pattern"
type: anomaly
detection_method: user_behavior_analytics
query: |
  source:cloudtrail @eventName:*
  | eval api_call_count=count(@eventName) by @userIdentity.arn
  | anomaly direction=above threshold=3
```

**CSPM Configuration:**
```yaml
# Compliance Framework: CIS AWS Foundations
framework: cis_aws
version: "2.0"

rules:
  - id: 1.1
    name: "Avoid the use of the root account"
    severity: critical
    
  - id: 2.1  
    name: "Ensure CloudTrail is enabled in all regions"
    severity: high
    
notifications:
  - channel: @slack-security-compliance
  - channel: @email-security-team
```

---

### Example 5: OpenTelemetry Migration

**User:** "We want to migrate from proprietary agents to OpenTelemetry"

**Datadog Engineer:**

```yaml
# Migration Strategy: Phased Approach

Phase 1 - Collector Deployment (Week 1-2)
├── Deploy OpenTelemetry Collector as sidecar/gateway
├── Route existing Datadog Agent traffic through Collector
└── Validate data quality and completeness

Phase 2 - Application Migration (Week 3-6)
├── Identify migration candidates (low-risk services first)
├── Replace dd-trace with OTel SDK
├── Maintain dual-export during transition
└── Gradual rollout with feature flags

# Collector Configuration for Migration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  datadog:
    endpoint: 0.0.0.0:8126

processors:
  batch:
  resource:
    attributes:
      - key: telemetry.source
        value: opentelemetry
        action: upsert
        
  # Ensure Datadog-specific attributes
  attributes:
    - key: service.name
      from_attribute: service.name
      action: upsert
    - key: env
      from_attribute: deployment.environment
      action: upsert

exporters:
  datadog:
    api:
      key: ${DD_API_KEY}
    site: datadoghq.com
    
  # Keep Datadog Agent exporter during transition
  datadog/agent:
    endpoint: datadog-agent:8126

service:
  pipelines:
    traces:
      receivers: [otlp, datadog]
      processors: [batch, resource, attributes]
      exporters: [datadog, datadog/agent]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [datadog]

# Validation Checklist:
□ Trace correlation works across services
□ Service Map shows all dependencies  
□ Metrics appear in dashboards
□ Logs link to traces correctly
□ SLO calculations remain accurate
□ Alert thresholds unchanged
□ No data loss during transition
```

---

## § 10 · Edge Cases

### 10.1 High Cardinality Scenarios

| Scenario | Problem | Solution |
|----------|---------|----------|
| User ID in tags | Millions of unique metrics | Use `user_type` or `user_segment` instead |
| Request ID tracing | Unbounded cardinality | Keep in logs only, not metric tags |
| Session tracking | Memory and cost explosion | Use Real User Monitoring (RUM) for sessions |

### 10.2 Special Monitoring Scenarios

**Serverless (Lambda):**
- Enable Datadog Lambda Extension
- Use `DD_TRACE_ENABLED=true`
- Configure async metrics submission

**Multi-Cloud:**
- Consistent tagging across AWS/GCP/Azure
- Use cloud-agnostic USM for service discovery
- Centralize cross-cloud dashboards

**AI/LLM Workloads:**
- Enable LLM Observability
- Track token usage and costs
- Monitor prompt injection attempts

---

## § 11 · Glossary

| Term | Definition |
|------|------------|
| **APM** | Application Performance Monitoring — distributed tracing and profiling |
| **Trace** | Complete end-to-end request path across services |
| **Span** | Single operation within a trace |
| **eBPF** | Extended Berkeley Packet Filter — kernel instrumentation technology |
| **USM** | Universal Service Monitoring — zero-code service discovery |
| **SLO** | Service Level Objective — target availability/latency |
| **SLI** | Service Level Indicator — metric measuring SLO |
| **Error Budget** | Allowed margin for errors within SLO window |
| **Burn Rate** | Speed at which error budget is consumed |
| **OTel** | OpenTelemetry — vendor-neutral observability standard |
| **Cardinality** | Number of unique tag value combinations |
| **RED Metrics** | Rate, Errors, Duration — key service health indicators |

---

## § 12 · Quick Reference

**Install:**
```bash
# Read this skill
curl -sSL https://raw.githubusercontent.com/theneoai/awesome-skills/main/skills/tools/observability/datadog-expert/SKILL.md
```

**Trigger Words:**
- Datadog, APM, monitoring, observability
- Distributed tracing, logs, metrics
- SLO, SLI, error budget
- OpenTelemetry, eBPF, USM
- Security, SIEM, CSPM

---

*This skill is based on Datadog platform features as of March 2026. For latest updates, refer to [Datadog Documentation](https://docs.datadoghq.com/).*


## Anti-Patterns

| Pattern | Avoid | Instead |
|---------|-------|---------|
| Generic | Vague claims | Specific data |
| Skipping | Missing validations | Full verification |

Related Skills

vault-secrets-expert

33
from theneoai/awesome-skills

HashiCorp Vault expert: KV secrets, dynamic credentials, PKI, auth methods. Use when managing secrets, setting up PKI, or implementing secrets management. Triggers: 'Vault', 'secrets management', 'HashiCorp Vault', 'dynamic credentials', 'PKI'.

nmap-expert

33
from theneoai/awesome-skills

Expert-level Nmap skill for network reconnaissance, port scanning, service detection, and security assessment. Triggers: 'Nmap', '网络扫描', '端口扫描', 'NSE脚本'. Works with: Claude Code, Codex, OpenCode, Cursor, Cline, OpenClaw, Kimi.

metasploit-expert

33
from theneoai/awesome-skills

Expert-level Metasploit Framework skill for penetration testing, exploit development, and post-exploitation operations. Triggers: 'Metasploit', '渗透测试', '红队', '漏洞利用'. Works with: Claude Code, Codex, OpenCode, Cursor, Cline, OpenClaw, Kimi.

container-security-expert

33
from theneoai/awesome-skills

Expert-level Container Security skill using Trivy, Snyk, and other tools for vulnerability scanning, compliance checking, and container hardening. Triggers: '容器安全', '漏洞扫描', 'Trivy', 'Docker安全', 'K8s安全'.

numpy-scipy-expert

33
from theneoai/awesome-skills

NumPy/SciPy expert: array operations, linear algebra, FFT, signal processing, optimization, interpolation, statistics, sparse matrices. Use when doing scientific computing with Python.

latex-expert

33
from theneoai/awesome-skills

LaTeX expert: document typesetting, mathematical typesetting, BibTeX/Biber, Beamer presentations, TikZ figures, custom macros, IEEE/ACM/Elsevier templates. Use when writing academic papers or technical documents.

slack-bot-expert

33
from theneoai/awesome-skills

Slack Bot expert: Bolt SDK development, slash commands, workflow automation, webhook integrations, and ChatOps patterns. Use when building Slack bots, automating notifications, or creating ChatOps workflows.

notion-expert

33
from theneoai/awesome-skills

Notion expert: database design, template creation, API integration, team workflows, formulas, relations. Use when organizing knowledge, managing projects, or building wikis in Notion.

miro-expert

33
from theneoai/awesome-skills

Expert Miro user for visual collaboration, workshops, and ideation. Use when facilitating remote workshops, mapping processes, or creating visual strategies

linear-expert

33
from theneoai/awesome-skills

Linear expert: issue management, Cycles, workflow automation, team workflows, project tracking. Use when managing projects, tracking issues, or optimizing team workflows with Linear. Triggers: 'Linear', 'issue tracking', 'Cycles', 'workflow', 'Linear API'.

jira-expert

33
from theneoai/awesome-skills

Jira expert: workflow configuration, sprint management, JQL advanced queries, dashboards, automation, and permissions. Use when managing projects, configuring workflows, or tracking issues in Jira.

confluence-expert

33
from theneoai/awesome-skills

Confluence expert: page templates, space configuration, Jira integration, macros, knowledge base architecture. Use when managing team wikis, documentation, or collaborative workspaces in Confluence.