datadog

Use when the user says "check Datadog", "查 Datadog", "查日志", "check logs", "crash logs", "查 crash", "gateway crash", "查告警", "check alerts", "check metrics", or needs to investigate production issues via Datadog Logs API.

2,280 stars

Best use case

datadog is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Use when the user says "check Datadog", "查 Datadog", "查日志", "check logs", "crash logs", "查 crash", "gateway crash", "查告警", "check alerts", "check metrics", or needs to investigate production issues via Datadog Logs API.

Teams using datadog should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/datadog/SKILL.md --create-dirs "https://raw.githubusercontent.com/nexu-io/nexu/main/skills/localdev/datadog/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/datadog/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How datadog Compares

Feature / AgentdatadogStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Use when the user says "check Datadog", "查 Datadog", "查日志", "check logs", "crash logs", "查 crash", "gateway crash", "查告警", "check alerts", "check metrics", or needs to investigate production issues via Datadog Logs API.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Datadog Log Investigation

Query Datadog Logs API to investigate production issues for the Nexu platform.

## Authentication

**Before making any Datadog API call, you MUST ask the user for these two keys:**

- `DD_API_KEY` — Datadog API Key (Organization Settings → API Keys)
- `DD_APP_KEY` — Datadog Application Key (Organization Settings → Application Keys, requires `logs_read_data` scope)

Store them in shell variables for the session. Never hardcode or commit them.

Site: `datadoghq.com` (US)

## API Base

All requests go to `https://api.datadoghq.com/api/v2/logs/events/search`.

Headers:
```
DD-API-KEY: <api_key>
DD-APPLICATION-KEY: <app_key>
Content-Type: application/json
```

## Common Queries

### OpenClaw Crash Events

```bash
curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-gateway @event:openclaw_crash",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "-timestamp",
    "page": {"limit": 20}
  }'
```

Key fields in results:
- `attributes.attributes.exitCode` — process exit code (1 = fatal error, null = signal)
- `attributes.attributes.signal` — kill signal (SIGKILL, SIGTERM, etc.)
- `attributes.tags` → `pod_name`, `image_tag` — which pod and which version

### OpenClaw stderr Output (Crash Details)

```bash
curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-gateway @stream:stderr",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "-timestamp",
    "page": {"limit": 50}
  }'
```

This shows the actual error output from the OpenClaw process (e.g., `invalid_auth`, `EADDRINUSE`, config validation failures).

### Gateway Startup / Recovery Events

```bash
curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-gateway (\"starting gateway\" OR \"gateway is ready\" OR \"spawned openclaw\")",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "timestamp",
    "page": {"limit": 30}
  }'
```

### Slack Token Health Check

```bash
curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-api slack_token_health*",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "-timestamp",
    "page": {"limit": 20}
  }'
```

### API HTTP Request Logs

```bash
curl -s "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "filter": {
      "query": "service:nexu-api http_request @attributes.status:>=500",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "-timestamp",
    "page": {"limit": 20}
  }'
```

### Filter by Pod

Add `pod_name:<name>` to the query:

```
service:nexu-gateway pod_name:nexu-gateway-1 @event:openclaw_crash
```

### Filter by Time Window

Use ISO 8601 timestamps:

```json
{
  "from": "2026-03-10T05:00:00Z",
  "to": "2026-03-10T06:00:00Z"
}
```

Or relative: `"now-30m"`, `"now-1h"`, `"now-24h"`.

## Parsing Results

Use python3 inline to extract key fields:

```bash
curl -s ... | python3 -c "
import json, sys
data = json.load(sys.stdin)
events = data.get('data', [])
print(f'Total events: {len(events)}')
for e in events:
    attrs = e['attributes']['attributes']
    tags = e['attributes']['tags']
    pod = next((t.split(':',1)[1] for t in tags if t.startswith('pod_name:')), '?')
    ts = attrs.get('time', '?')
    msg = e['attributes'].get('message', '')[:120]
    print(f'{ts} | pod={pod} | {msg}')
"
```

## Services and Events Reference

| Service | Description |
|---------|-------------|
| `nexu-gateway` | Gateway sidecar (manages OpenClaw process) |
| `nexu-api` | API server |

| Event | Meaning |
|-------|---------|
| `openclaw_crash` | OpenClaw process exited unexpectedly |
| `openclaw_restart_scheduled` | Sidecar scheduling a restart |
| `openclaw_restart_limit` | Max restart attempts exceeded |
| `openclaw_orphan_killed` | Killed zombie OpenClaw process |
| `slack_token_health_check_invalidated` | Invalid Slack tokens detected and marked |

## Tag Reference

| Tag | Example |
|-----|---------|
| `pod_name` | `nexu-gateway-1`, `nexu-gateway-2` |
| `image_tag` | `sha-55f13372bb72abc7db1538cca3db2bcda0d35eba` |
| `kube_stateful_set` | `nexu-gateway` |

## Investigation Playbook

When investigating a crash:

1. **Check crash events** — get exit codes, signals, timestamps, affected pods
2. **Check stderr** — get the actual error message from OpenClaw
3. **Check startup events** — correlate crash with deploy times (`image_tag` changes)
4. **Check token health** — if `invalid_auth`, look for `slack_token_health_check_invalidated`
5. **Check API logs** — if API errors are contributing

## Rules

1. **Never hardcode API keys** in skill files or logs — always use variables
2. **Default time window** — start with `now-1h`, expand to `now-24h` if needed
3. **Always parse and summarize** — don't dump raw JSON to the user
4. **Correlate across services** — crashes often involve both gateway and API logs
5. **Check image_tag** to determine if crashes are related to a specific deployment

Related Skills

static-deploy

2280
from nexu-io/nexu

Deploy static pages to nexu.space. Use when user says deploy, publish, ship, or go live with a static site/page. Uploads files from workspace to <project-slug>.nexu.space via Wrangler + Cloudflare Pages. Supports first deploy and redeploy.

nano-banana

2280
from nexu-io/nexu

Generate or edit images via Nano Banana image models. Triggers on "generate image", "image generation", "nano banana", "edit image", "nano banana pro", "nano banana 2"

feedback

2280
from nexu-io/nexu

Send feedback to the Nexu team. Use when the user says /feedback followed by their message.

sync-specs

2280
from nexu-io/nexu

Use when code changes may have made documentation outdated, when reviewing docs for consistency, or when the user asks to sync or audit documentation.

nexu-e2e-test

2280
from nexu-io/nexu

Use when verifying OpenClaw gateway fixes end-to-end, testing skill loading after restart, or running integration tests against the local Nexu+OpenClaw stack. Triggers on "e2e test", "verify fix", "test gateway", "test skills loading".

feishu-update-doc

2280
from nexu-io/nexu

更新飞书云文档。支持 7 种更新模式:追加、覆盖、定位替换、全文替换、前/后插入、删除。

feishu-troubleshoot

2280
from nexu-io/nexu

飞书插件问题排查工具。包含常见问题 FAQ 和深度诊断命令(/feishu_doctor)。 常见问题可随时查阅。诊断命令用于排查复杂问题(多次授权仍失败、自动授权无法解决等), 会检查账户配置、API 连通性、应用权限、用户授权状态,并生成详细的诊断报告和解决方案。

feishu-task

2280
from nexu-io/nexu

飞书任务管理工具,用于创建、查询、更新任务和清单。 **当以下情况时使用此 Skill**: (1) 需要创建、查询、更新、删除任务 (2) 需要创建、管理任务清单 (3) 需要查看任务列表或清单内的任务 (4) 用户提到"任务"、"待办"、"to-do"、"清单"、"task" (5) 需要设置任务负责人、关注人、截止时间

feishu-im-read

2280
from nexu-io/nexu

飞书 IM 消息读取工具使用指南,覆盖会话消息获取、话题回复读取、跨会话消息搜索、图片/文件资源下载。 **当以下情况时使用此 Skill**: (1) 需要获取群聊或单聊的历史消息 (2) 需要读取话题(thread)内的回复消息 (3) 需要跨会话搜索消息(按关键词、发送者、时间等条件) (4) 消息中包含图片、文件、音频、视频,需要下载 (5) 用户提到"聊天记录"、"消息"、"群里说了什么"、"话题回复"、"搜索消息"、"图片"、"文件下载" (6) 需要按时间范围过滤消息、分页获取更多消息

feishu-fetch-doc

2280
from nexu-io/nexu

获取飞书云文档内容。返回文档的 Markdown 内容,支持处理文档中的图片、文件和画板(需配合 feishu_doc_media 工具)。

feishu-create-doc

2280
from nexu-io/nexu

创建飞书云文档。从 Lark-flavored Markdown 内容创建新的飞书云文档,支持指定创建位置(文件夹/知识库/知识空间)。

feishu-calendar

2280
from nexu-io/nexu

飞书日历与日程管理工具集。包含日历管理、日程管理、参会人管理、忙闲查询。