grafana-lens

Grafana tools for data visualization, monitoring, alerting, security, SRE investigation, and data collection pipeline management via Alloy. Use grafana_query, grafana_query_logs, grafana_query_traces, grafana_create_dashboard, grafana_update_dashboard, grafana_create_alert, grafana_share_dashboard, grafana_annotate, grafana_explore_datasources, grafana_list_metrics, grafana_search, grafana_get_dashboard, grafana_check_alerts, grafana_push_metrics, grafana_explain_metric, grafana_security_check, grafana_investigate, and alloy_pipeline. Trigger when asked about metrics, dashboards, monitoring, alerts, costs, token usage, data visualization, PromQL, Prometheus, LogQL, Loki, log queries, error logs, log search, TraceQL, Tempo, traces, distributed tracing, span search, find slow traces, debug session traces, annotations, deployments, sharing charts, investigating alert notifications, pushing custom data (calendar, git, fitness, finance) to Grafana for visualization, pushing historical data, backfilling metrics, recording past data with timestamps, modifying dashboards, adding panels, removing panels, changing dashboard settings, updating dashboard time range, explain metric, metric trend, what is this metric, how has this changed, is this metric normal, why did my bill spike, cost visibility, security monitoring, security check, security audit, am I being attacked, is my agent compromised, suspicious activity, threat detection, prompt injection detection, set up security alerts, investigate, debug, triage, root cause, what's wrong, why is X broken, anomaly detection, RED method, USE method, alert fatigue, postmortem, incident summary, collect metrics from, monitor my database, monitor my app, scrape endpoint, set up log collection, collect Docker logs, tail log files, collect Kubernetes logs, receive OTLP, set up trace collection, data collection pipeline, Alloy pipeline, pipeline status, pipeline health, node exporter, system metrics, postgres exporter, mysql exporter, redis exporter, syslog, Grafana Alloy.
3,891 stars
byopenclaw
View on GitHub Installation ↓
Best use case

grafana-lens is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Teams using grafana-lens should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.
Installation

Claude Code / Cursor / Codex
$curl -o ~/.claude/skills/grafana-lens/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/awsome-o/grafana-lens/SKILL.md"
Manual Installation
Download SKILL.md from GitHub
Place it in .claude/skills/grafana-lens/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill
How grafana-lens Compares

Feature / Agent	grafana-lens	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A
Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.
Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for ChatGPT

Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.
Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
SKILL.md Source

# Grafana Lens

You have full native Grafana access — query data, create dashboards, set alerts, receive alert notifications, annotate events, explore datasources, push custom data, and deliver visualizations inline. Works with ANY data in Grafana, not just agent metrics.

## Musts

- **Always call `grafana_explore_datasources` first** when you need a datasource UID — never guess UIDs
- **Always call `grafana_search` before creating a dashboard** — avoid duplicates
- **Always call `grafana_get_dashboard` before `grafana_share_dashboard`** — you need exact panel IDs
- **Always call `grafana_get_dashboard` before `grafana_update_dashboard`** — you need panel IDs and current structure
- **Prefer `grafana_query` for direct answers** over creating dashboards — "what's my cost?" needs a number, not a URL
- **Prefer `grafana_query` over `grafana_create_dashboard` + `grafana_share_dashboard`** for simple data questions — a number is faster than a chart
- **Use `grafana_query_logs` for log searches** — LogQL for logs, PromQL for metrics, TraceQL for traces. Never use `grafana_query` for Loki datasources
- **Use `grafana_query_traces` for trace searches** — TraceQL for traces, PromQL for metrics, LogQL for logs. Never use `grafana_query` or `grafana_query_logs` for Tempo datasources
- **All tools work with ANY Prometheus datasource** — not just `openclaw_lens_*` metrics
- **When you see "GRAFANA ALERTS" in prompt context**, investigate immediately with `grafana_check_alerts` — use the `suggestedInvestigation` field to go directly to querying (it provides the tool, query, and datasource)
- **Run `grafana_check_alerts` with action `setup` once** before alert notifications can reach the agent — this creates the webhook contact point
- **Push data before querying or dashboarding it** — data is pushed via OTLP and available immediately
- **Prefer `grafana_explain_metric` for "what is this metric?" questions** over manual `grafana_query` — it returns current value, trend, stats, and metadata in one call
- **Use `queryNames` from push response for PromQL queries** — don't guess metric names (counters get `_total` suffix)
- **Use `openclaw_ext_` prefix for custom metrics** — `grafana_push_metrics` auto-prepends it if missing
- **Follow statistics-first discipline for log investigation** — always run count/rate LogQL before reading individual entries. Use `grafana_query_logs` with metric-over-logs queries (`count_over_time`, `rate`, `topk`) before switching to raw log entries
- **Silence alerts during investigation** — use `grafana_check_alerts` with action `silence` to prevent repeat notifications while investigating
- **Use `list_rules` for complete alert health** — `grafana_check_alerts` with action `list_rules` returns all rules with live eval state (normal/firing/pending/nodata/error), health, and lastEvaluation — no need to cross-reference with `list` action
- **Use `dashboardUid` + `panelId` to re-run panel queries** — don't manually extract PromQL/LogQL from `get_dashboard` output. Both `grafana_query` and `grafana_query_logs` accept these params to auto-resolve the panel's query expression and datasource. The tool handles template variable replacement and datasource routing automatically
- **Confirm with user before deleting dashboards or alert rules** — `grafana_update_dashboard` with operation `delete` and `grafana_check_alerts` with action `delete_rule` are permanent and cannot be undone
- **Always use `alloy_pipeline` action `recipes` first** when unsure which pipeline recipe fits the user's request — because recipes provide validation, credential handling, and sample queries that raw config does not
- **Always call `alloy_pipeline` action `status`** after creating a pipeline — because data takes 15-20s to flow through the pipeline, and components may fail silently after reload
- **Never guess Alloy component names** — use recipes for known patterns, or raw `config` only when the user explicitly provides Alloy syntax
- **Prefer recipes over raw config** when a recipe exists — recipes provide validation, sample queries, credential handling, dashboard templates, and automatic export target wiring
- **Never write credentials into raw `config`** — when the user provides a connection string, DSN, password, or API key, ALWAYS use the matching recipe (which routes credentials through `sys.env()`, keeping secrets off disk). If you must use raw config, wrap sensitive values in `sys.env("MY_VAR_NAME")` and tell the user to set that env var where Alloy runs
- **Read `envVarsRequired` from every pipeline create response** — credential recipes may return `pending_credentials` status when env vars aren't set yet. Tell the user the exact var names and that they must set them where Alloy runs, then verify with action `status`
- **Warn users before creating credential-required pipelines** — Alloy config reload is atomic: if a credential recipe's env vars aren't set, the reload failure blocks ALL managed pipelines (not just the new one) until the env vars are set or the pipeline is deleted. Always ask: "Do you have the credentials ready to set as env vars on the Alloy host?"
- **Chain pipeline creation into existing tools** — after pipeline is active: `grafana_list_metrics` or `grafana_query_logs` to discover data, `grafana_create_dashboard` to visualize, `grafana_create_alert` to monitor
- **Use `alloy_pipeline` action `diagnose`** as first step when user reports pipeline issues — because it checks Alloy connectivity, all pipeline health, config file drift, and limits in one call
- **Confirm with user before deleting pipelines** — `alloy_pipeline` with action `delete` removes the config and data stops flowing
- **All log recipes accept processing params** — don't create separate "processing" pipelines. Add `jsonExpressions`, `labelFields`, `structuredMetadata`, `tenantValue`, `matchRoutes`, etc. directly to any log recipe (docker-logs, file-logs, syslog, etc.)
- **Use `samplingPolicies` for multi-policy tail sampling** — don't create raw config when `application-traces` can handle it. `sampleRate` is for simple probabilistic, `samplingPolicies` is for intelligent multi-policy (keep errors, keep slow, sample rest)
- **Use log processing params for multi-tenant routing** — `tenantValue`/`tenantSource`/`matchRoutes` work on ALL log recipes. Don't create separate "routing" pipelines
- **Read `references/alloy-components.md` before composing raw config** — it has copy-pasteable snippets for all common Alloy components

## Quick Decision Tree

- "What is [metric]?" / "Why did it spike?" → `grafana_explain_metric`
- "What's the current value of X?" / complex PromQL → `grafana_query`
- "Find error logs" / "Search logs for..." → `grafana_query_logs`
- "Find slow traces" / "Show trace for session X" / "Debug distributed spans" → `grafana_query_traces`
- "Debug this session" / "Why did it fail?" / "What went wrong?" → `grafana_query_traces` (search error/slow) → `grafana_query_traces` (get → follow `correlationHint`) → `grafana_query_logs` → `grafana_query` → `grafana_annotate`
- "Show me a chart" / "Visualize..." → `grafana_search` → `grafana_get_dashboard` → `grafana_share_dashboard`
- "Create a dashboard for..." → `grafana_search` (check duplicates) → `grafana_create_dashboard`
- "Add a panel to my dashboard" → `grafana_get_dashboard` → `grafana_update_dashboard`
- "Delete this dashboard" → `grafana_update_dashboard` with operation `delete` (confirm with user first)
- "Alert me when..." → `grafana_check_alerts` (setup) → `grafana_create_alert`
- "List my alert rules" / "What alerts do I have?" → `grafana_check_alerts` with action `list_rules`
- "Delete alert rule X" → `grafana_check_alerts` with action `list_rules` → `delete_rule` with `ruleUid`
- "Track my [custom data]" / "Record my [past data]" → `grafana_push_metrics` (with optional `timestamp` for historical data, auto-registers, returns `queryNames`) → `grafana_query` with `queryNames`
- "What data sources do I have?" → `grafana_explore_datasources`
- "What metrics are available?" → `grafana_list_metrics`
- "Set up monitoring" / "Monitor my agent" / "What dashboards should I have?" → `grafana_search` (check existing) → `grafana_create_dashboard` with `llm-command-center` → follow `suggestedNext` chain through remaining templates
- "GenAI observability" / "OTel gen_ai metrics" / "Standard AI monitoring" → `grafana_create_dashboard` with `genai-observability` template
- "What happened in session X?" / "Debug this session" → `grafana_create_dashboard` with `session-explorer` template → paste session ID
- "Show me LLM traces" / "Show agent logs" → `grafana_create_dashboard` with `llm-command-center` template (Loki + Tempo)
- "How much am I spending?" / "Cost analysis" → `grafana_create_dashboard` with `cost-intelligence` template
- "Which tools are slow?" / "Tool errors" → `grafana_create_dashboard` with `tool-performance` template
- "Queue health" / "Webhook issues" / "Stuck sessions" → `grafana_create_dashboard` with `sre-operations` template
- "System health check" / "Status report" / "Review all dashboards" → `grafana_explore_datasources` → `grafana_check_alerts` (list + list_rules) → `grafana_search` → `grafana_get_dashboard` (audit=true for each) → summarize
- "Audit my dashboard" / "Which panels are broken?" → `grafana_get_dashboard` (audit=true) → review `auditSummary` + per-panel `health`
- "Am I being attacked?" / "Security check" / "Security status" → `grafana_security_check`
- "Set up security monitoring" → `grafana_check_alerts` (setup) → `grafana_create_dashboard` (`security-overview`) → `grafana_create_alert` (webhook error burst, cost spike, tool loops, injection signals)
- "Investigate security alert" → `grafana_security_check` → `grafana_query_logs` (correlate) → `grafana_annotate` (mark investigation) → `grafana_check_alerts` (silence)
- "Investigate this alert" / "Why is X broken?" / "Debug this issue" / "Triage" / "Root cause" → `grafana_investigate` (multi-signal triage) → follow `suggestedHypotheses.testWith` for deep-dives
- "Is this metric normal?" / "Is there an anomaly?" → `grafana_explain_metric` (returns `anomaly` z-score + `seasonality` vs 1d/7d ago for 24h period)
- "RED analysis" / "What's the error rate?" / "Service health" → RED Method queries (see sre-investigation.md §2)
- "Alert fatigue" / "Which alerts are noisy?" / "Alert health" → `grafana_check_alerts` with action `analyze` — fatigue report
- "Postmortem" / "Incident summary" / "What happened?" → `grafana_investigate` → 5-Phase methodology → postmortem template (see sre-investigation.md §9)
- "Compare before/after deployment" → `grafana_annotate` (list, tags: ["deploy"]) → `grafana_explain_metric` (compareWith: "previous")

### Data Collection Pipelines (Alloy)

- "Monitor a service/database/app" → `alloy_pipeline` action `recipes` (filter by category) → select recipe → `create` → `status` → query → dashboard → alert
- "Scrape metrics from [endpoint]" / "My app exposes /metrics" → `alloy_pipeline` with recipe `scrape-endpoint` + params `{ url }`
- "Monitor PostgreSQL/MySQL/Redis/MongoDB/Memcached" → `alloy_pipeline` with recipe `[db]-exporter` + params `{ connectionString }`
- "Collect and parse logs with JSON extraction" → `alloy_pipeline` (log recipe + processing params: `jsonExpressions`, `labelFields`, `structuredMetadata`)
- "Collect Docker logs" / "See container logs in Grafana" → `alloy_pipeline` with recipe `docker-logs`
- "Tail log files" / "Collect app logs from /var/log" → `alloy_pipeline` with recipe `file-logs` + params `{ paths }`
- "Accept logs via HTTP push API" / "Centralized log gateway" → `alloy_pipeline` with recipe `loki-push-api`
- "Consume logs from Kafka" → `alloy_pipeline` with recipe `kafka-logs` + params `{ brokers, topics }`
- "Set up syslog collection" → `alloy_pipeline` with recipe `syslog`
- "Monitor endpoint availability" / "Synthetic probing" / "HTTP health checks" → `alloy_pipeline` with recipe `blackbox-exporter` + params `{ targets }`
- "Kubernetes monitoring" / "Monitor my K8s cluster" → `alloy_pipeline` with recipe `kubernetes-pods` + `kubernetes-services` + `kubernetes-logs` (3 pipelines)
- "Receive OTLP data" / "Set up trace collection" → `alloy_pipeline` with recipe `otlp-receiver`
- "Generate RED metrics from traces" / "Span metrics" → `alloy_pipeline` with recipe `span-metrics`
- "Service dependency graph from traces" → `alloy_pipeline` with recipe `service-graph`
- "Monitor Alloy itself" / "Self-monitoring" → `alloy_pipeline` with recipe `self-monitoring`
- "Redact secrets from logs" / "Compliance logging" → `alloy_pipeline` with recipe `secret-filter-logs` + params `{ paths }`
- "Monitor Elasticsearch/Kafka" → `alloy_pipeline` with recipe `elasticsearch-exporter` / `kafka-exporter`
- "System metrics" / "Node monitoring" / "CPU/memory/disk" → `alloy_pipeline` with recipe `node-exporter`
- "Docker container metrics" / "Container resource usage" → `alloy_pipeline` with recipe `docker-metrics`
- "Reduce trace costs" / "Keep only error traces" / "Smart trace sampling" / "Tail sampling" → `alloy_pipeline` with recipe `application-traces` + `samplingPolicies` array (keep errors, keep slow, filter health checks, sample rest)
- "Multi-tenant Loki" / "Route logs by tenant" / "Different tenants for different apps" → any log recipe + `tenantValue` or `matchRoutes` processing param
- "Profile my app" / "CPU profiling" / "Memory profiling" / "Continuous profiling" / "Go pprof" → `alloy_pipeline` with recipe `continuous-profiling` + `targets`
- "Frontend observability" / "Browser RUM" / "Web vitals" / "Faro SDK" → `alloy_pipeline` with recipe `faro-frontend`
- "GELF logs" / "Graylog" / "Docker GELF driver" → `alloy_pipeline` with recipe `gelf-logs`
- "Custom Alloy pattern" / "Advanced pipeline" → Read `references/alloy-components.md` → `alloy_pipeline` with raw `config` + optional `sampleQueries`
- "What data collection recipes are available?" → `alloy_pipeline` with action `recipes`
- "What pipelines do I have?" / "Pipeline list" → `alloy_pipeline` with action `list`
- "Is my pipeline working?" / "Pipeline health" → `alloy_pipeline` with action `status` + name
- "Pipeline problems" / "Why isn't data showing up?" → `alloy_pipeline` with action `diagnose` → follow remediation
- "Delete pipeline" / "Remove monitoring for..." → `alloy_pipeline` with action `delete` + name (confirm with user first)

## Working with Multiple Grafana Instances

When several Grafana environments are configured (dev, staging, prod), every tool accepts an optional `instance` parameter. `grafana_explore_datasources` returns `availableInstances` — use the `name` values from that list.

**Why this matters**: Users often need to query production metrics, create dashboards in dev, or compare environments side by side. Each tool call targets one instance.

**Smart defaults**: Omitting `instance` always targets the configured default — safe and invisible for single-environment setups. Only specify `instance` when the user explicitly names a non-default environment.

**Cross-environment workflows**: Each call is independent. Query prod, create dashboard in dev — just set `instance` differently on each call. No context switching needed.

## Tool Inventory

| Tool | What It Does |
|------|-------------|
| `grafana_explore_datasources` | Discover configured datasources (UIDs, types, query routing) — tells you which tool + query language to use for each datasource |
| `grafana_list_metrics` | Discover available metrics or label values from a datasource. Use `compact: true` with `metadata: true` for minimal fields in multi-tool chains |
| `grafana_query` | Run PromQL instant/range queries — get numbers directly |
| `grafana_query_logs` | Run LogQL queries against Loki — search and filter logs |
| `grafana_query_traces` | Run TraceQL queries against Tempo — search traces or get full trace by ID |
| `grafana_create_dashboard` | Create dashboards from templates or custom JSON |
| `grafana_update_dashboard` | Add/remove/update panels, change dashboard metadata, or delete dashboard |
| `grafana_get_dashboard` | Get dashboard summary (panels, queries). Use `compact: true` for overview scans, `audit: true` to health-check all panels in one call |
| `grafana_search` | Search existing dashboards by title, tags, or starred status |
| `grafana_share_dashboard` | Render panel as image and deliver inline via messaging |
| `grafana_create_alert` | Create Grafana-native alert rules on any metric |
| `grafana_annotate` | Create or list annotations (events) on dashboards |
| `grafana_check_alerts` | Check, acknowledge, list/delete rules, silence/unsilence, or set up Grafana alert webhook notifications. Use `compact: true` with `list_rules` for minimal fields |
| `grafana_push_metrics` | Push custom data (calendar, git, fitness, finance) via OTLP |
| `grafana_explain_metric` | Get metric context: current value, trend, stats, metadata, drill-down queries — agent interprets |
| `grafana_security_check` | Run 6 parallel security checks and return threat-level assessment (green/yellow/red) — "Am I being attacked?" |
| `grafana_investigate` | Multi-signal investigation triage — gathers metrics, logs, traces, and context in parallel, generates hypotheses with specific tool+params for follow-up |
| `alloy_pipeline` | Create and manage Alloy data collection pipelines — 29 recipes for metrics, logs, traces, profiles from any infrastructure (databases, K8s, Docker, apps, profiling, frontend RUM) |

## Tool Details

### `grafana_explore_datasources`
**When**: First step when user mentions data, metrics, or monitoring. Gets datasource UIDs needed by `grafana_query`, `grafana_query_logs`, `grafana_query_traces`, `grafana_list_metrics`, `grafana_create_alert`, and `grafana_explain_metric`.
**Params**: `instance` (optional — target Grafana instance, omit for default).
**Example**: `{}`
**Example (multi-instance)**: `{ "instance": "prod" }`
**Returns**: List of datasources with `uid`, `name`, `type`, `isDefault`, plus routing hints: `queryTool` (which agent tool to use, e.g. `"grafana_query"`, `"grafana_query_logs"`, or `"grafana_query_traces"`), `queryLanguage` (e.g. `"PromQL"`, `"LogQL"`, `"TraceQL"`), and `supported` (boolean — whether an agent tool can query this datasource). Use `queryTool` to pick the right tool for each datasource. When multiple Grafana instances are configured, also returns `instance` (which instance was queried) and `availableInstances` (list of `{ name, url, isDefault }` for all configured instances).

### `grafana_list_metrics`
**When**: User asks "what metrics are available?" or you need to discover metrics before querying or composing dashboards. Also when grouping metrics by function — metadata mode adds `category` to each `openclaw_*` metric. Use `purpose` when user asks about a specific concern (e.g., "performance metrics", "cost metrics").
**Params**: `datasourceUid` (required), `prefix` (filter by prefix), `search` (targeted discovery — server-side regex, only matching metrics returned), `purpose` (`"performance"` | `"cost"` | `"reliability"` | `"capacity"` — pre-filter by intent, composable with prefix and search), `label` (list label values instead), `metadata` (boolean — enriched results with type/help/category), `compact` (boolean — with metadata, returns only name/type/category, ~60% smaller).
**Example names**: `{ "datasourceUid": "prom1", "prefix": "openclaw_lens_" }`
**Example search**: `{ "datasourceUid": "prom1", "search": "steps" }`
**Example purpose**: `{ "datasourceUid": "prom1", "purpose": "performance", "metadata": true }`
**Example combined**: `{ "datasourceUid": "prom1", "prefix": "openclaw_ext_", "search": "fitness" }`
**Example metadata**: `{ "datasourceUid": "prom1", "metadata": true, "prefix": "openclaw_" }`
**Example compact**: `{ "datasourceUid": "prom1", "metadata": true, "compact": true }`
**Returns names**: `{ metrics: ["metric1", "metric2", ...] }`. Truncated at 200.
**Returns metadata**: `{ metadataSource, categorySummary: { cost: 3, usage: 4, session: 5, ... }, metrics: [{ name, type, help, category?, source? }, ...] }`. Use this before composing custom dashboards — type tells you counter vs gauge vs histogram, category groups `openclaw_*` metrics by function. Search also matches help text. Categories: `cost`, `usage`, `session`, `queue`, `messaging`, `webhook`, `tools`, `agent`, `custom`. `categorySummary` gives counts per category for quick overview (omitted when no `openclaw_*` metrics). Purpose maps: `performance` → session + tools, `cost` → cost + usage, `reliability` → webhook + messaging + agent, `capacity` → queue + session. `metadataSource`: `"prometheus"` when Prometheus metadata endpoint has data, `"synthetic"` when OTLP-only (metadata synthesized from known metric registry — histogram sub-metrics deduplicated, type/help from Grafana Lens definitions). On OTLP stacks, includes `hint` explaining why metadata is synthetic. `source: "synthetic"` on individual entries from the registry; `source: "custom"` on entries from the custom metrics store.
**Returns compact**: `{ metadataSource, categorySummary: {...}, metrics: [{ name, type, category? }, ...] }`. Same as metadata but drops `help`, `source`, `labelNames` — use in multi-tool chains where you need metric names and types but not full descriptions.
**Example label**: `{ "datasourceUid": "prom1", "label": "job" }`
**Returns label**: `{ label, count, totalCount, values: ["value1", "value2", ...] }`. Truncated at 200.

### `grafana_query`
**When**: User asks a data question that needs a direct answer, not a dashboard. Also for re-running an existing dashboard panel's query with different time ranges.
**Params**: `datasourceUid`, `expr` (PromQL), `queryType` (`instant`/`range`), `start` (range only, required), `end` (range only, default `"now"`), `step` (range only, optional — auto-calculated from time range if omitted, targeting ~300 datapoints), `dashboardUid` (optional — resolve query from panel), `panelId` (optional — use with `dashboardUid`).
**Example instant**: `{ "datasourceUid": "prom1", "expr": "sum(increase(openclaw_lens_cost_by_model_total[1d])) or vector(0)" }`
**Example range (auto-step)**: `{ "datasourceUid": "prom1", "expr": "rate(openclaw_tokens_total[5m])", "queryType": "range", "start": "now-30d" }`
**Example range (explicit step)**: `{ "datasourceUid": "prom1", "expr": "rate(openclaw_tokens_total[5m])", "queryType": "range", "start": "now-1h", "end": "now", "step": "60" }`
**Example panel re-run**: `{ "dashboardUid": "openclaw-command-center", "panelId": 10, "queryType": "range", "start": "now-7d" }`
**Tip**: `start`/`end` accept Unix seconds or relative expressions like `"now-1h"`, `"now-7d"`. For range queries, just set `start` — `end` defaults to `"now"` and `step` is auto-calculated. Override `step` only when you need specific resolution.
**Tip (panel re-run)**: Set `dashboardUid` + `panelId` to re-run a panel's query without manually extracting PromQL. The tool auto-resolves `expr` and `datasourceUid` from the panel definition. Template variables are replaced with wildcards. You can still override `expr` or `datasourceUid` explicitly if needed. Get panel IDs from `grafana_get_dashboard`.
**Returns instant**: `{ metrics: [{ metric: {...}, value: "1.23", timestamp: "...", healthContext?: { status, thresholds, description, direction } }], datasourceUid, resultCount, warnings?, hint? }` — `healthContext` is included for well-known `openclaw_lens_*` gauge metrics, providing SRE-grade health assessment: `status` ("healthy"/"warning"/"critical"), `thresholds` (warning/critical values), `description` (what the metric means), `direction` ("higher_is_worse"/"lower_is_worse"). Omitted for unknown metrics. Capped at 50 results; when exceeded includes `truncated: true`, `totalResults`, and `truncationHint` advising to narrow the query.
**Returns range**: `{ series: [{ metric: {...}, values: [{ time, value }...] }], datasourceUid, resultCount, warnings?, hint? }` — truncated to 20 points per series and 50 series max. When series are truncated includes `truncated: true`, `totalSeries`, and `truncationHint`. When step is auto-calculated, includes `step: { value: "288s", display: "5m", auto: true }`.
**Returns (panel re-run)**: Includes `resolvedFrom: "panel"`, `panelTitle`, `panelType`, `templateVarsReplaced` alongside normal query results. If the panel uses a Loki datasource, returns an error directing you to use `grafana_query_logs` instead.
**Returns (warnings)**: When Prometheus flags a non-fatal issue (e.g., `rate()` on a gauge), `warnings: [{ cause, suggestion, example? }]` is included. Example: `rate()` on a gauge → cause says "rate() applied to 'metric' which appears to be a gauge", suggestion says "use delta() or deriv() instead", example shows the corrected query.
**Returns (hint)**: When the query returns zero results, `hint: { cause, suggestion }` explains why (metric may not exist, label filters may not match) and suggests using `grafana_list_metrics` to verify.
**Returns (error with guidance)**: On query failure, includes `guidance: { cause, suggestion, example? }` alongside the raw error. Pattern-matched for common PromQL mistakes: unclosed parenthesis, missing range selector, timeout, auth failure, rate on gauge, etc. Omitted when the error is unrecognized.
**Tip (chaining)**: Both instant and range responses include `datasourceUid` — pass it directly to `grafana_create_alert` or other tools without re-calling `grafana_explore_datasources`. This enables zero-friction query→alert chains.

### `grafana_query_logs`
**When**: User asks about logs, errors, or needs to investigate issues by searching log data. Also for session debugging, OTel log investigation, and re-running existing log panel queries.
**Params**: `datasourceUid`, `expr` (LogQL), `queryType` (`instant`/`range`, default `range`), `start`/`end` (default `now-1h`/`now`), `step` (metric queries only), `limit` (default 100), `direction` (`backward`/`forward`), `lineLimit` (max chars per log line, default 500, max 2000), `extractFields` (boolean, default false — extract structured OTel attributes into a clean `fields` object), `dashboardUid` (optional — resolve query from panel), `panelId` (optional — use with `dashboardUid`).
**Example log search**: `{ "datasourceUid": "loki1", "expr": "{job=\"api\"} |= \"error\"" }`
**Example with filters**: `{ "datasourceUid": "loki1", "expr": "{job=\"api\"} |~ \"timeout|refused\"", "limit": 50, "direction": "forward" }`
**Example full stack traces**: `{ "datasourceUid": "loki1", "expr": "{job=\"api\"} |= \"Exception\"", "lineLimit": 2000 }`
**Example session debugging**: `{ "datasourceUid": "loki1", "expr": "{service_name=\"openclaw\"} | json | component=\"lifecycle\"", "extractFields": true }`
**Example metric query**: `{ "datasourceUid": "loki1", "expr": "rate({job=\"api\"}[5m])", "queryType": "range", "start": "now-6h", "end": "now", "step": "60" }`
**Example panel re-run**: `{ "dashboardUid": "openclaw-command-center", "panelId": 18, "start": "now-24h", "extractFields": true }`
**Returns streams**: `{ entries: [{ labels: {...}, timestamp: "...", line: "..." }], datasourceUid, totalEntries, truncated }` — capped at 100 entries, lines at 500 chars (set `lineLimit: 2000` for full stack traces).
**Returns streams (extractFields)**: `{ entries: [{ labels: {...cleaned...}, timestamp: "...", line: "...", fields: { component, event_name, session_id, trace_id, model, duration_s, ... } }], datasourceUid }` — infrastructure noise labels removed, `openclaw_` prefix stripped from field keys, numeric values auto-converted. Also parses JSON log bodies if present.
**Returns streams (traceCorrelation)**: When `extractFields: true` and entries contain `trace_id`, includes `traceCorrelation: { traceIds: [...], tool: "grafana_query_traces", tip }` — up to 5 unique trace IDs ready for `grafana_query_traces` with `queryType: "get"`.
**Returns metric**: Same shape as `grafana_query` range/instant results (matrix capped at 50 series, vector capped at 50 results — includes `datasourceUid`, `truncated`, `totalSeries`/`totalResults`, and `truncationHint` when exceeded).
**Returns (panel re-run)**: Includes `resolvedFrom: "panel"`, `panelTitle`, `panelType`, `templateVarsReplaced` alongside normal results. If the panel uses a Prometheus datasource, returns an error directing you to use `grafana_query` instead.
**Returns (error with guidance)**: On query failure, includes `guidance: { cause, suggestion, example? }` alongside the raw error. Pattern-matched for common LogQL mistakes: bare text without stream selector, empty `{}`, unclosed braces, missing label matchers, auth failure, timeout. Omitted when the error is unrecognized.
**Tip**: LogQL: `{label="value"}` selects streams, `|=` substring filter, `|~` regex, `!=` exclude. Metric wrappers: `rate()`, `count_over_time()`, `bytes_rate()`. Use `extractFields: true` when investigating OTel/lifecycle logs — it surfaces `trace_id`, `session_id`, `event_name`, `model`, and other attributes as first-class fields instead of buried in raw labels.
**Tip (panel re-run)**: Same as `grafana_query` — set `dashboardUid` + `panelId` to auto-resolve LogQL and datasource. The tool routes Prometheus panels to `grafana_query` with a helpful error.

### `grafana_query_traces`
**When**: User asks about traces, distributed tracing, slow spans, session trace hierarchies, or needs to debug request flows across services.
**Params**: `datasourceUid`, `query` (TraceQL expression or trace ID), `queryType` (`search`/`get`, default `search`), `start`/`end` (default `now-1h`/`now`), `limit` (default 20, max 50), `minDuration`/`maxDuration` (e.g., `"1s"`, `"10s"`), `dashboardUid` (optional — resolve query from panel), `panelId` (optional — use with `dashboardUid`).
**Example search**: `{ "datasourceUid": "tempo1", "query": "{ resource.service.name = \"openclaw\" }" }`
**Example search slow**: `{ "datasourceUid": "tempo1", "query": "{ resource.service.name = \"openclaw\" }", "minDuration": "5s" }`
**Example search with time**: `{ "datasourceUid": "tempo1", "query": "{ span.gen_ai.system = \"anthropic\" }", "start": "now-24h", "limit": 50 }`
**Example get**: `{ "datasourceUid": "tempo1", "query": "abc123def456789...", "queryType": "get" }`
**Example panel re-run**: `{ "dashboardUid": "openclaw-session-explorer", "panelId": 12, "start": "now-24h" }`
**Returns search**: `{ traces: [{ traceId, rootServiceName, rootTraceName, startTime, durationMs, spanCount? }], datasourceUid, totalTraces, truncated?, correlationHint? }` — capped at 50 traces. When exceeded includes `truncated: true` and `truncationHint`. When traces are found, includes `correlationHint: { logQuery, tool, tip }` with a ready-to-use LogQL expression for `grafana_query_logs`.
**Returns get**: `{ traceId, spans: [{ traceId, spanId, parentSpanId?, operationName, serviceName, startTime, durationMs, status, kind?, attributes: {...} }], datasourceUid, totalSpans, truncated? }` — flattened OTLP spans with resolved attributes (string/number/boolean). Capped at 200 spans. Sorted by start time (earliest first).
**Returns (panel re-run)**: Includes `resolvedFrom: "panel"`, `panelTitle`, `panelType`, `templateVarsReplaced` alongside normal results. If the panel uses a Prometheus or Loki datasource, returns an error directing you to use the correct tool.
**Returns (error with guidance)**: On query failure, includes `guidance: { cause, suggestion, example? }` alongside the raw error. Pattern-matched for common TraceQL mistakes: syntax errors, invalid attributes, auth failure, timeout, not-found, invalid trace ID. Omitted when the error is unrecognized.
**Returns (no results)**: When search returns zero traces, includes `hint: { cause, suggestion }` suggesting to broaden the query or check the datasource.
**Tip**: TraceQL: `{ }` matches all traces, `resource.service.name` for service filter, `span.http.status_code` for HTTP spans, `name` for operation name, `duration` for span duration, `status` for error/ok filtering. Use `minDuration`/`maxDuration` to find performance outliers. **Trace-to-Log**: search and get results include `correlationHint.logQuery` — pass it directly to `grafana_query_logs` to find correlated logs. **Log-to-Trace**: `grafana_query_logs` results (with `extractFields: true`) include `traceCorrelation.traceIds` — pass any ID to `grafana_query_traces` with `queryType: "get"`.
**Tip (panel re-run)**: Same as `grafana_query` — set `dashboardUid` + `panelId` to auto-resolve TraceQL and datasource. The tool routes Prometheus/Loki panels to the correct tool with a helpful error.

### `grafana_create_dashboard`
**When**: User wants a persistent dashboard for ongoing monitoring.
**Params**: `template` or `dashboard` (custom JSON) — one required. Optional: `title` (overrides template default), `folderUid` (target folder), `overwrite` (default `true`).
**Returns**: `{ uid, url, status, message, suggestedNext?: [{ template, reason }], validation?: DashboardValidation }`. For template-based dashboards, `suggestedNext` lists complementary templates to deploy next. For custom JSON dashboards, `validation` dry-runs each panel's PromQL and reports per-panel health — check `validation.panelsError` for broken queries.

**Choose the right template (3-tier SRE drill-down hierarchy):**

**Tier 1 → System:** Start here for overall health.
**Tier 2 → Session:** Click a session from Tier 1 to investigate.
**Tier 3 → Deep Dive:** Cost, tool, or SRE details.

| Template | Tier | Domain | Variables | Use When |
|----------|------|--------|-----------|----------|
| `llm-command-center` | **Tier 1** | System overview | `$prometheus`, `$loki`, `$tempo`, `$provider`, `$model`, `$channel` | Golden signals, session table with click-to-drill-down, cost, cache, live feeds |
| `session-explorer` | **Tier 2** | Session debug | `$prometheus`, `$loki`, `$tempo`, `$session` (textbox) | Per-session trace hierarchy, LLM calls, tool calls, conversation flow |
| `cost-intelligence` | Tier 3a | Cost analysis | `$prometheus`, `$loki`, `$provider`, `$model` | Spending trends, model attribution, cache savings, per-session cost table |
| `tool-performance` | Tier 3b | Tool analytics | `$prometheus`, `$loki`, `$tempo`, `$tool` | Tool leaderboard, latency ranking, error rates, tool traces |
| `sre-operations` | Tier 3c | SRE operations | `$prometheus`, `$loki` | Queue health, webhooks, stuck sessions, tool loops |
| `genai-observability` | — | **OTel gen_ai standard** | `$prometheus`, `$loki`, `$tempo`, `$model`, `$provider` | Industry-standard AI monitoring: token analytics, LLM performance, traces, logs, cache efficiency. Works with any gen_ai data. |
| `node-exporter` | — | System/DevOps | `$datasource`, `$instance` | Server CPU, memory, disk, network |
| `http-service` | — | Web/DevOps | `$datasource`, `$job` | HTTP request rate, errors, latency (RED signals) |
| `metric-explorer` | — | **Any domain** | `$datasource`, `$metric` | Deep-dive into any single metric from a dropdown |
| `multi-kpi` | — | **Any domain** | `$datasource`, `$metric1`..`$metric4` | 4-metric KPI overview (business, fitness, finance, IoT) |
| `weekly-review` | — | **Any domain** | `$datasource`, `$metric1`, `$metric2` | Weekly overview of 2 external metrics with trends + all openclaw_ext_* table |

All AI templates have Loki log-to-trace correlation via Tempo + stable UIDs for cross-dashboard navigation.

**Example AI health**: `{ "template": "llm-command-center", "title": "My AI Dashboard" }`
**Example session debug**: `{ "template": "session-explorer", "title": "Session Debug" }`
**Example cost analysis**: `{ "template": "cost-intelligence", "title": "My AI Costs" }`
**Example tool analytics**: `{ "template": "tool-performance", "title": "Tool Health" }`
**Example SRE ops**: `{ "template": "sre-operations", "title": "SRE Health" }`
**Example GenAI observability**: `{ "template": "genai-observability", "title": "GenAI Observability" }`
**Example system**: `{ "template": "node-exporter", "title": "Server Health" }`
**Example generic**: `{ "template": "metric-explorer", "title": "Explore My Data" }`
**Example multi-KPI**: `{ "template": "multi-kpi", "title": "Business KPIs" }`
**Example weekly review**: `{ "template": "weekly-review", "title": "My Weekly Review" }`
**Example custom with validation**: `{ "dashboard": { "title": "Model Comparison", "panels": [{ "id": 1, "title": "Cost by Model", "type": "timeseries", "targets": [{ "refId": "A", "expr": "sum by (model) (rate(openclaw_lens_cost_by_token_type[1h]))", "datasource": { "uid": "prometheus" } }] }] } }`

**Custom dashboard validation** (returned only for `dashboard` param, not templates):
`validation: { panelsTotal: 3, panelsValid: 1, panelsNoData: 1, panelsError: 1, panelsSkipped: 0, details: [{ panelId: 1, title: "Cost by Model", status: "ok", queries: [{ refId: "A", expr: "...", valid: true, sampleValue: 0.42 }] }, { panelId: 2, title: "Latency", status: "nodata" }, { panelId: 3, title: "Bad Query", status: "error", error: "parse error at char 5" }] }`
Panel statuses: `ok` (query returned data), `nodata` (valid query, no results — metric may not exist yet), `error` (PromQL syntax error or datasource issue), `skipped` (no datasource UID found). Dashboard is always created regardless — validation is informational.

### `grafana_update_dashboard`
**When**: User wants to add a panel, remove a panel, change a query, update dashboard settings, or delete a dashboard.
**Params**: `uid` (required), `operation` (required: `add_panel`, `remove_panel`, `update_panel`, `update_metadata`, `delete`).
**add_panel params**: `panel` (object with `title`, `type`, `targets`). Auto-layouts below existing panels.
**remove_panel / update_panel params**: `panelId` (preferred) or `panelTitle` (case-insensitive substring fallback). `updates` (object) for update_panel.
**update_metadata params**: `title`, `description`, `tags`, `time` (e.g., `{ "from": "now-7d", "to": "now" }`), `refresh` (e.g., `"1m"`).
**delete params**: None besides `uid` — permanently removes the dashboard. Always confirm with user first.
**Example add**: `{ "uid": "abc123", "operation": "add_panel", "panel": { "title": "Error Rate", "type": "timeseries", "targets": [{ "refId": "A", "expr": "rate(errors_total[5m])", "datasource": { "uid": "prom1" } }] } }`
**Example add (no datasource)**: `{ "uid": "abc123", "operation": "add_panel", "panel": { "title": "Latency", "type": "timeseries", "targets": [{ "refId": "A", "expr": "histogram_quantile(0.99, rate(http_duration_bucket[5m]))" }] } }` — validation skipped if no datasource UID found, panel still saved.
**Example remove**: `{ "uid": "abc123", "operation": "remove_panel", "panelId": 3 }`
**Example update panel**: `{ "uid": "abc123", "operation": "update_panel", "panelId": 1, "updates": { "title": "New Title", "targets": [{ "refId": "A", "expr": "new_query" }] } }`
**Example update metadata**: `{ "uid": "abc123", "operation": "update_metadata", "title": "My Dashboard v2", "time": { "from": "now-7d", "to": "now" }, "refresh": "5m" }`
**Example delete**: `{ "uid": "abc123", "operation": "delete" }`
**Returns update**: `{ status: "updated", uid, url, version, operation, panelCount, affectedPanel?: { id, title }, changedFields?: [...], queryValidation?: { validated, results, datasourceUid?, skippedReason? } }`.
**Returns queryValidation**: For `add_panel` and `update_panel` (when targets change), PromQL queries are dry-run against Grafana. Each result: `{ refId, expr, valid: boolean, error?: string, sampleValue?: number }`. Panel is always saved — validation is informational. If `valid: false`, check the `error` field for PromQL syntax issues. If `skippedReason` is set, no datasource UID was found — include `datasource: { uid: "..." }` on targets to enable validation.
**Returns delete**: `{ status: "deleted", uid, title, message }`.
**Tip**: `targets` in update_panel replaces entirely — include all targets, not just changed ones. Include `datasource.uid` on targets for query validation feedback.

### `grafana_get_dashboard`
**When**: Need to inspect a dashboard's panels — find panel IDs for sharing, verify structure, scan multiple dashboards for an overview, or audit which panels are returning data.
**Params**: `uid` (required). Optional: `compact` (boolean, default `false`) — return panel titles and types only, no queries or metadata (~70% smaller). `audit` (boolean, default `false`) — dry-run each panel's query and add `health` status.
**Example (full)**: `{ "uid": "abc123" }`
**Example (compact overview)**: `{ "uid": "abc123", "compact": true }`
**Example (audit)**: `{ "uid": "abc123", "audit": true }`
**Returns (full)**: `{ uid, title, description?, url, tags, time?, refresh?, panelCount, panels: [{ id, title, type, queries: [{ refId, expr }] }], folderUid, created?, updated? }`.
**Returns (compact)**: `{ uid, title, url, tags, panelCount, panels: [{ id, title, type }] }`.
**Returns (audit)**: Same as full, plus each panel gets `health: { status: "ok"|"nodata"|"error"|"skipped", error?, sampleValue? }` and the response includes `auditSummary: { ok, nodata, error, skipped }`. Resolves template variable datasources (`$prometheus`, `$loki`) and replaces expression template vars with wildcards.
**Tip**: Use `audit: true` when the user asks "which panels are broken?" or "audit my dashboard" — it replaces N separate `grafana_query` calls with one tool call. Use `compact: true` for lightweight overview scans. Omit both when you need query details (before update or share).

### `grafana_search`
**When**: User mentions a dashboard by name, before creating one (check duplicates), or for reporting/audit workflows.
**Params**: `query` (required). Optional: `tags` (array — filter by tags), `starred` (boolean — only starred), `sort` (`"alpha-asc"`/`"alpha-desc"`), `limit` (number, default 100), `enrich` (boolean — add `updatedAt` + `panelCount` per result, default false).
**Example**: `{ "query": "cost" }`
**Example with tags**: `{ "query": "", "tags": ["production"] }`
**Example starred**: `{ "query": "", "starred": true, "limit": 10 }`
**Example enriched**: `{ "query": "", "enrich": true }`
**Returns**: `{ count, enriched, dashboards: [{ uid, title, url, tags, folderTitle?, folderUid?, updatedAt?, panelCount? }] }`. `folderTitle`/`folderUid` always included when dashboard is in a folder. `updatedAt` (ISO 8601) and `panelCount` only present when `enrich: true` — enables staleness detection and reporting without per-dashboard `get_dashboard` calls.
**Tip**: Use `enrich: true` for reporting workflows ("which dashboards are stale?", "give me a summary of all dashboards"). Skip enrichment for simple lookups. After finding a dashboard, use `grafana_get_dashboard` to inspect panels, `grafana_share_dashboard` to render a chart, or `grafana_update_dashboard` to modify it.

### `grafana_share_dashboard`
**When**: User says "show me" or "send me" a chart/dashboard.
**Params**: `dashboardUid`, `panelId` (required). Optional: `from` (default `"now-6h"`), `to` (default `"now"`), `width` (default `1000`), `height` (default `500`), `theme` (`"light"`/`"dark"`, default `"dark"`).
**Example**: `{ "dashboardUid": "abc123", "panelId": 2, "from": "now-6h", "to": "now" }`
**Returns**: Image rendered inline (tier 1), or snapshot URL (tier 2), or deep link (tier 3). Always delivers something. Includes `deliveryTier` (`"image"` | `"snapshot"` | `"link"`), `rendererAvailable` (boolean — false when Image Renderer plugin is missing), `renderFailureReason` (why image rendering failed), and `remediation` (how to fix it). Tier 3 also includes `snapshotFailureReason`.
**Tip**: Use `grafana_get_dashboard` first to find panel IDs. If `rendererAvailable` is false, tell the user to install the grafana-image-renderer plugin.

### `grafana_create_alert`
**When**: User wants notifications when a metric crosses a threshold.
**Params**: `title`, `datasourceUid`, `expr` (PromQL), `threshold` (all required). Optional: `evaluation` (`"instant"`/`"rate"`/`"increase"`, default `"instant"`), `evaluationWindow` (default `"5m"`, used with `rate`/`increase`), `condition` (`gt`/`lt`/`gte`/`lte`, default `gt`), `for` (duration, default `5m`), `folderUid`, `labels` (e.g., `{ "severity": "warning" }`), `annotations` (e.g., `{ "summary": "Cost too high" }`), `noDataState` (`NoData`/`Alerting`/`OK`, default `NoData`).
**IMPORTANT**: For counter metrics (`*_total`), always use `evaluation: "rate"` (per-second rate) or `evaluation: "increase"` (total change over window). Raw counter values always increase and will immediately breach any threshold. Use `"instant"` (default) only for gauges.
**Example gauge alert**: `{ "title": "High Cost Alert", "datasourceUid": "prom1", "expr": "openclaw_lens_daily_cost_usd", "threshold": 5, "condition": "gt" }`
**Example rate alert**: `{ "title": "High Error Rate", "datasourceUid": "prom1", "expr": "openclaw_lens_webhook_error_total", "threshold": 0.1, "evaluation": "rate" }`
**Example increase alert**: `{ "title": "Token Burst", "datasourceUid": "prom1", "expr": "openclaw_lens_tokens_total", "threshold": 10000, "evaluation": "increase", "evaluationWindow": "1h" }`
**Returns**: `{ uid, title, status: "created", datasourceUid, url, evaluation?: { mode, window, evaluatedExpr }, metricValidation: { valid, error?, sampleValue? }, message }`. The `datasourceUid` echoes back which datasource the rule targets (verify correctness). `metricValidation` dry-runs the expression before creation — `valid: true` + `sampleValue` confirms data exists; `valid: false` + `error` warns of typos/missing metrics. Alert is always created regardless (metric may not have data yet). When `evaluation` is `"rate"` or `"increase"`, validation runs the wrapped expression.
**Note**: Auto-creates a "Grafana Lens Alerts" folder if no `folderUid` is specified.

### `grafana_annotate`
**When**: User deploys, changes config, or wants to mark an event for correlation.
**Params**: `action` (`"create"` default, or `"list"`).
**Create params**: `text` (required), `tags`, `dashboardUid`, `panelId`, `time` (epoch ms or relative like `"now-2h"`, default now), `timeEnd` (epoch ms or relative).
**List params**: `from`, `to` (epoch ms or relative like `"now-7d"`, `"now-24h"`, `"now"`), `tags`, `limit` (default `20`).
**Time formats**: All time params accept epoch ms (e.g., `1700000000000`) OR Grafana-style relative strings (`"now"`, `"now-1h"`, `"now-7d"`, `"now-30m"`). Prefer relative strings — they're simpler and avoid arithmetic errors.
**Example create**: `{ "text": "Deployed v2.1.0", "tags": ["deploy", "production"] }`
**Example create past**: `{ "text": "Incident started", "time": "now-2h", "timeEnd": "now-30m", "tags": ["incident"] }`
**Example list recent**: `{ "action": "list", "from": "now-7d", "to": "now", "tags": ["deploy"] }`
**Example list**: `{ "action": "list", "tags": ["deploy"], "limit": 10 }`
**Returns create**: `{ status: "created", id, message, time, comparisonHint: { beforeWindow: { from, to }, afterWindow: { from, to }, suggestion } }`. The `comparisonHint` provides ready-to-use ISO 8601 time ranges (30-min windows) for before/after comparison via `grafana_query` — no manual time math needed. For region annotations (with `timeEnd`), `afterWindow` starts at `timeEnd`.
**Returns list**: `{ annotations: [{ id, text, tags, time, timeEnd?, dashboardUID?, panelId? }] }`.

### `grafana_check_alerts`
**When**: Prompt context shows "GRAFANA ALERTS", need to manage alert rules (list/delete), set up the alert webhook, silence alerts during investigation, or acknowledge an investigated alert.
**Params**: `action` (`"list"` default, `"acknowledge"`, `"list_rules"`, `"delete_rule"`, `"silence"`, `"unsilence"`, `"setup"`).
**List params**: None — returns all pending (unacknowledged) alerts. Instances capped at 5 per alert.
**Acknowledge params**: `alertId` (required) — marks an alert as investigated.
**List rules params**: `compact` (boolean, default false — returns only uid/title/state/condition). Full mode returns all configured alert rules from Grafana with UID, title, condition (PromQL), folder, labels, annotations, AND live evaluation state (normal/firing/pending/nodata/error), health, and lastEvaluation. One call gives the complete alert health picture.
**Delete rule params**: `ruleUid` (required) — permanently deletes an alert rule. Get UIDs from `list_rules`.
**Silence params**: `matchers` (required — array of `{ name, value, isRegex? }` from alert's `commonLabels`), `duration` (default `"2h"`), `comment` (optional).
**Unsilence params**: `silenceId` (required) — removes a silence so alerts resume notifying.
**Setup params**: `webhookUrl` (optional, auto-detected) — creates webhook contact point and notification policy route in Grafana.
**Example list**: `{}`
**Example acknowledge**: `{ "action": "acknowledge", "alertId": "alert-1" }`
**Example list rules**: `{ "action": "list_rules" }`
**Example list rules compact**: `{ "action": "list_rules", "compact": true }`
**Example delete rule**: `{ "action": "delete_rule", "ruleUid": "abc123-def456" }`
**Example silence**: `{ "action": "silence", "matchers": [{ "name": "alertname", "value": "HighCost" }], "duration": "2h", "comment": "Investigating cost spike" }`
**Example unsilence**: `{ "action": "unsilence", "silenceId": "silence-uuid-123" }`
**Example setup**: `{ "action": "setup" }`
**Returns list**: `{ status: "success", alertCount, alerts: [{ id, status, title, message, receivedAt, commonLabels, totalInstances, truncated?, suggestedInvestigation?: { datasourceUid, condition, tool, queryLanguage, hint }, instances: [{ status, labels, annotations, startsAt, values }] }] }`. `suggestedInvestigation` is auto-enriched by matching the alert to its rule — provides the PromQL/LogQL expression, datasource, and tool to use for immediate investigation (eliminates the need for separate `list_rules` + `explore_datasources` calls).
**Returns acknowledge**: `{ status: "acknowledged", alertId }`.
**Returns list_rules**: `{ status: "success", ruleCount, rules: [{ uid, title, folder, ruleGroup, state, health, lastEvaluation, for, labels, annotations, condition, updated }] }`. `state` is the live evaluation state: `"normal"` (not firing), `"firing"`, `"pending"` (within for duration), `"nodata"`, or `"error"`. Falls back to `"unknown"` if the eval state API is unavailable. `health` is `"ok"`, `"nodata"`, `"error"`, or `"unknown"`. `condition` is the extracted PromQL expression from the rule's data queries.
**Returns list_rules (compact)**: `{ status: "success", ruleCount, rules: [{ uid, title, state, condition }] }`. Minimal fields for multi-tool chains — use when you need a quick overview of all rules without details.
**Returns delete_rule**: `{ status: "deleted", ruleUid, message }`.
**Returns silence**: `{ status: "silenced", silenceId, duration, matchers, message }`.
**Returns unsilence**: `{ status: "unsilenced", silenceId, message }`.
**Returns setup**: `{ status: "created", contactPointUid, webhookUrl }` or `{ status: "already_exists", contactPointUid }`.
**Note**: Setup is idempotent — safe to call multiple times. Only alerts with `managed_by=openclaw` label route to the webhook (auto-added by `grafana_create_alert`). Use `list_rules` → `delete_rule` for full alert lifecycle management (create via `grafana_create_alert`, list/delete via `grafana_check_alerts`).

### `grafana_push_metrics`
**When**: User wants to track custom data (calendar events, git commits, fitness stats, financial data) in Grafana.
**Params**: `action` (`"push"` default, `"register"`, `"list"`, `"delete"`).
**Push params**: `metrics` (required array) — each: `{ name, value, labels?, type?, help?, timestamp? }`. Names auto-get `openclaw_ext_` prefix. `timestamp` is optional ISO 8601 for historical data (gauge only).
**Register params**: `name` (required), `type` (`"gauge"`/`"counter"`, default `"gauge"`), `help`, `labelNames` (array), `ttlDays`.
**List params**: None — returns all custom metric definitions.
**Delete params**: `name` (required) — removes a custom metric.
**Example push**: `{ "metrics": [{ "name": "steps_today", "value": 8000 }, { "name": "meetings", "value": 3, "labels": { "type": "standup" } }] }`
**Example backfill**: `{ "metrics": [{ "name": "steps", "value": 8000, "timestamp": "2025-01-15" }, { "name": "steps", "value": 10500, "timestamp": "2025-01-16" }] }`
**Example mixed**: `{ "metrics": [{ "name": "steps", "value": 9000, "timestamp": "2025-01-17" }, { "name": "heart_rate", "value": 72 }] }`
**Example register**: `{ "action": "register", "name": "weight_kg", "type": "gauge", "help": "Body weight", "labelNames": ["person"], "ttlDays": 90 }`
**Example list**: `{ "action": "list" }`
**Example delete**: `{ "action": "delete", "name": "old_metric" }`
**Returns push**: `{ status: "ok", accepted: 2, queryNames: { "openclaw_ext_steps": "openclaw_ext_steps", "openclaw_ext_events": "openclaw_ext_events_total" }, suggestedWorkflow: [{ tool, action, example }], message: "..." }`. `suggestedWorkflow` contains concrete next-step examples using the actual pushed metric names — verify (grafana_query), visualize (grafana_create_dashboard with metric-explorer template), and alert (grafana_create_alert, single-metric only). Partial success supported. Timestamped and real-time points in the same batch are both accepted.
**Returns register**: `{ status: "registered", metric: { name, type, help, labelNames, ttlMs }, queryName: "openclaw_ext_events_total", suggestedWorkflow: [{ tool, action, example }] }`. `suggestedWorkflow` shows how to push data and query the registered metric (with `rate()` wrapping for counters).
**Returns list**: `{ count, metrics: [{ name, type, queryName, help, labelNames, createdAt, updatedAt }] }`.
**Returns delete**: `{ status: "deleted", name }`.
**Note**: Push auto-registers unknown metrics. Response includes `queryNames` with exact PromQL names and `suggestedWorkflow` with concrete next steps. Follow `suggestedWorkflow` to complete the push→visualize pipeline. Timestamped pushes are gauge-only — counters with timestamps are rejected. See [external-data.md](references/external-data.md) for naming conventions and backfill patterns.

### `grafana_explain_metric`
**When**: User asks "what does this metric mean?", "why did it spike?", "is this normal?", or "show me the trend".
**Params**: `datasourceUid` (required), `expr` (PromQL or plain metric name, required), `period` (`24h`/`7d`/`30d`, default `24h`), `compareWith` (`"previous"` — compare current period with the same-length window immediately before it).
**Example**: `{ "datasourceUid": "prom1", "expr": "openclaw_lens_daily_cost_usd" }`
**Example counter**: `{ "datasourceUid": "prom1", "expr": "openclaw_lens_tokens_total" }`
**Example 7d**: `{ "datasourceUid": "prom1", "expr": "openclaw_lens_daily_cost_usd", "period": "7d" }`
**Example comparison**: `{ "datasourceUid": "prom1", "expr": "openclaw_lens_daily_cost_usd", "period": "7d", "compareWith": "previous" }`
**Example PromQL**: `{ "datasourceUid": "prom1", "expr": "rate(http_requests_total[5m])", "period": "24h" }`
**Returns**: `{ metricType?, trendQuery?, current: { value, timestamp }, healthContext?: { status, thresholds, description, direction }, trend: { changePercent, direction, first, last }, stats: { min, max, avg, samples }, comparison?: { previousPeriod: { from, to, avg, min, max, samples }, change: { absolute, percentage, direction } }, metadata: { type, help, unit }, suggestedQueries?: [{ query, description }], suggestedBreakdowns?: string[] }`. Sections omitted when data unavailable. `changePercent` is `null` when first value is zero. `healthContext` is included for well-known `openclaw_lens_*` gauge metrics — same as `grafana_query`.
**Counter-aware**: Auto-detects counter metrics (via metadata `type` or `_total` suffix) and wraps the trend query in `rate(expr[5m])`. The `current` value stays raw (cumulative total), but `trend` and `stats` show rate of change. `metricType` field tells you the detected type (counter/gauge/histogram). `trendQuery` shows the actual PromQL used for trend (only present when different from `expr`).
**Drill-down**: For multi-dimensional metrics (metrics with labels like `model`, `token_type`, `provider`), the response includes `suggestedQueries` — ready-to-use PromQL queries for `grafana_query` that break down the metric by each label. Counter metrics get `rate()` wrapping automatically. Use these to investigate cost attribution, identify top contributors, or decompose aggregates.
**Breakdowns**: `suggestedBreakdowns` provides label names for decomposition — always available for known OpenClaw metrics (cost, session, queue, webhook families) even when the metric has no data yet. For unknown metrics, falls back to labels discovered from the instant query. Use these labels with `grafana_query` to build `sum by (label) (...)` queries for root-cause analysis.
**Period comparison**: Use `compareWith: "previous"` for period-over-period analysis (e.g., this week vs. last week). Returns a `comparison` object with the previous period's stats and the change (absolute, percentage, direction). Works with counters too (compares rates). Eliminates the need for manual multi-query workflows.
**Tip**: For simple trend context, call with just `period`. For "did things improve?" questions, add `compareWith: "previous"`. Metadata only available for plain metric names (not complex PromQL). No need to manually wrap counters in `rate()` — the tool does it automatically.

### `grafana_security_check`
**When**: User asks "am I being attacked?", "security status", "security audit", "security check", or wants a comprehensive threat assessment.
**Params**: `lookback` (time window, default `"1h"`. Use `"24h"` for daily review, `"7d"` for weekly).
**Example**: `{}`
**Example weekly review**: `{ "lookback": "7d" }`
**Returns**: `{ overallThreatLevel: "green"|"yellow"|"red", summary, checks: [{ name, status, value, detail }], securityEventLogs, limitations: [...], suggestedActions: [...], dashboardTemplate: "security-overview" }`.
**Checks** (6 parallel, via Promise.allSettled — partial failures return partial results):
1. `webhook_error_ratio` — webhook error rate vs received rate. Warning 20%, critical 50%.
2. `cost_anomaly` — daily cost. Warning $10, critical $50.
3. `tool_loops` — active tool loops. Warning 1, critical 3.
4. `injection_signals` — prompt injection patterns detected. Warning 1, critical 5.
5. `session_enumeration` — unique sessions in 1h. Warning 50, critical 200.
6. `stuck_sessions` — stuck sessions. Warning 1, critical 3.
**Limitations**: Auth failures are NOT observable — openclaw's auth middleware emits no diagnostic events. The tool monitors observable signals (webhook errors, cost spikes, prompt injection patterns, session anomalies) but cannot detect silent auth-layer attacks (bad tokens, brute-force, rate-limiter lockouts). Always includes limitations in response.
**Auto-discovery**: No datasource UID needed — auto-discovers first Prometheus datasource. Loki is optional (skips log check if absent).
**Follow-up**: Use `dashboardTemplate` to create a persistent security dashboard. Use `suggestedActions` for investigation steps.

### `grafana_investigate`
**When**: First step for "investigate this", "what's wrong?", "triage", "root cause analysis", "debug this issue", or any multi-signal investigation.
**Params**: `focus` (required — alert UID, metric name, or symptom text), `timeWindow` (optional, default `"1h"`, options: `"1h"`, `"6h"`, `"24h"`), `service` (optional, default `"openclaw"`).
**Example**: `{ "focus": "high error rate" }`
**Example with alert**: `{ "focus": "alert-abc", "timeWindow": "6h" }`
**Example with metric**: `{ "focus": "openclaw_lens_daily_cost_usd", "timeWindow": "24h" }`
**Returns**: `{ timeWindow, focus, metricSignals: { focus: { current, trend, anomalyScore, anomalySeverity }, red: { rate, errorRate, p95Latency } }, logSignals: { totalVolume, errorCount, bySeverity, topPatterns, sampleErrors }, traceSignals: { errorTraces, slowTraces }, contextSignals: { recentAnnotations, alertsActive }, suggestedHypotheses: [{ hypothesis, evidence, confidence, testWith: { tool, params } }], limitations }`.
**Auto-discovery**: No datasource UID needed — auto-discovers Prometheus, Loki, and Tempo datasources. Loki and Tempo are optional (graceful degradation with limitations noted).
**Follow-up**: Use `suggestedHypotheses[].testWith` for deep-dives with specific tools. Use `grafana_annotate` to mark findings. Use `grafana_check_alerts` to acknowledge investigated alerts.

### `alloy_pipeline`
**When**: Trigger whenever the user mentions data collection, metrics pipeline, log forwarding, trace collection, infrastructure monitoring, database monitoring, observability setup, scraping endpoints, collecting Docker/container logs, Kubernetes monitoring, syslog, or any mention of getting data into Grafana. Also trigger for managing existing pipelines (status, health, diagnostics). 26 recipes covering metrics (11), logs (8), traces (4), profiles (1), and infrastructure (3). This is the bridge between "I have a data source" and "I can see it in Grafana" — if the user has data somewhere and wants it in Grafana, this tool is the answer.
**Actions**: `create` (deploy pipeline from recipe), `list` (show managed pipelines), `update` (change params), `delete` (remove pipeline), `recipes` (browse catalog), `status` (check health), `diagnose` (Alloy connectivity + all pipelines).
**Params (create with recipe)**: `recipe` (required), `params` (recipe-specific), `name` (optional — auto-generated from recipe name with numeric suffix for uniqueness, e.g., `syslog`, `syslog-2`, `syslog-3`; always read the `name` field from the response to get the actual assigned name). Log recipes also accept optional processing params: `jsonExpressions` (object — JSON path extractions), `labelFields` (object — promote fields to labels), `structuredMetadata` (object — Loki structured metadata), `staticLabels` (object), `timestampSource` (string), `outputSource` (string), `regexExpression` (string).
**Params (create with raw config)**: `config` (raw Alloy River syntax), `name` (required), `signal` (optional — `"metrics"`, `"logs"`, `"traces"`, or `"profiles"`, auto-detected from component IDs if omitted), `sampleQueries` (optional — object with `metrics`/`logs`/`traces` keys containing ready-to-use queries). Response includes `exportTargets` showing where data is being sent.
**Params (update)**: `name` (required), `params` (fields to change — merges with existing). Raw-config pipelines support update via full config replacement — pass `config` param with the complete new config. Recipe-based pipelines use `params` for partial updates (merges with existing). Response includes `sampleQueries` and `suggestedWorkflow` for chaining into query/dashboard tools.
**Params (status)**: `name` (required).
**Params (recipes)**: `category` (optional filter).
**Example create (scrape)**: `{ "recipe": "scrape-endpoint", "params": { "url": "http://myapp:8080/metrics" } }`
**Example create (database)**: `{ "recipe": "postgres-exporter", "params": { "connectionString": "postgres://user:pass@db:5432/mydb" }, "name": "analytics-db" }`
**Example create (logs)**: `{ "recipe": "docker-logs", "params": { "containerNames": ["myapp", "nginx"] } }`
**Example create (files)**: `{ "recipe": "file-logs", "params": { "paths": ["/var/log/app/*.log"] } }`
**Example create (K8s)**: `{ "recipe": "kubernetes-pods", "params": { "namespaces": ["production"] } }`
**Example create (OTLP)**: `{ "recipe": "otlp-receiver" }`
**Example create (traces with sampling)**: `{ "recipe": "application-traces", "params": { "environment": "staging", "sampleRate": 0.5 } }`
**Example create (syslog UDP)**: `{ "recipe": "syslog", "params": { "protocol": "udp", "listenAddress": "0.0.0.0:5514" } }`
**Example create (raw config)**: `{ "config": "prometheus.scrape ...", "name": "custom-pipeline", "signal": "metrics" }`
**Example recipes**: `{ "action": "recipes", "category": "logs" }`
**Example status**: `{ "action": "status", "name": "analytics-db" }`
**Example diagnose**: `{ "action": "diagnose" }`
**Returns (create)**: `{ status, name, recipe, signal, configFile, reloaded, envVarsRequired, warnings, sampleQueries: { metrics|logs|traces: {...} }, suggestedWorkflow }`. `status` is `"created"` (active) or `"pending_credentials"` (env vars not set yet). `sampleQueries` are ready-to-use PromQL/LogQL/TraceQL — pass directly to query tools. `envVarsRequired` lists env vars for credential-bearing recipes. `warnings` contains advisory notes about unrecognized params. Raw-config pipelines may also include `credentialWarnings` if plaintext credentials were detected.
**Returns (recipes)**: `{ categories, recipes: [{ name, category, signal, summary, requiredParams, optionalParams, hasCredentials, dashboardTemplate }] }`. Use `summary` to match user intent.
**Returns (status)**: `{ name, status, components: [{ id, health, detail }], dataVerification: { verifyQuery, tool }, remediation }`. Pending pipelines auto-promote to `active` when components become healthy.
**Returns (diagnose)**: `{ alloyConnectivity, pipelineHealth, driftDetected, orphanFiles, limits }`.
**Tip (credential lifecycle)**: Credential recipes (postgres-exporter, mysql-exporter, mongodb-exporter, redis-exporter, scrape-endpoint with auth, kafka-logs) may return `pending_credentials` status when env vars aren't set. Config stays on disk — tell user the exact env var names from `envVarsRequired`, then verify with action `status`. Pipeline auto-activates once Alloy can connect.
**Tip (chaining)**: `sampleQueries` → `grafana_query`/`grafana_query_logs`. `dashboardTemplate` → `grafana_create_dashboard`. Zero-friction pipeline→query→dashboard→alert chains.
**Tip (multi-pipeline)**: For K8s: `kubernetes-pods` + `kubernetes-logs` + `otlp-receiver` (3 separate creates).
**Tip (raw config escape hatch)**: Use `config` param for patterns not covered by recipes. Signal is auto-detected from component IDs (or set `signal` explicitly). Add `sampleQueries` so chaining tools know what to query. Component IDs are auto-extracted for health checking. Supports update via full config replacement. See `references/alloy-components.md` for copy-pasteable component snippets.
**Tip (port conflicts)**: Trace recipes (otlp-receiver, application-traces, span-metrics, service-graph) default to ports 4317/4318 — creating multiple on the same ports will fail. Set `grpcPort`/`httpPort` params to different values for each.
**Tip (cross-pipeline isolation)**: Pipelines are self-contained — each .alloy file has no cross-file references. For combined patterns (e.g., OTLP receiver + span-metrics + service-graph in one config), use raw `config` param with the full combined pipeline. You cannot wire one pipeline's output into another pipeline.

### Security Monitoring

**Honest limitations**: OpenClaw's auth middleware does not emit diagnostic events for failed authentication attempts (bad tokens, brute-force, rate-limiter lockouts). Gateway-level auth failures are invisible to Grafana Lens. The security-check tool monitors observable signals (webhook errors, cost spikes, prompt injection patterns, session anomalies) but cannot detect silent auth-layer attacks.

**Security PromQL queries**:
- Webhook error spike: `rate(openclaw_lens_webhook_error_total[5m]) > 0.5`
- Webhook error ratio: `rate(openclaw_lens_webhook_error_total[5m]) / (rate(openclaw_lens_webhook_received_total[5m]) + 0.001) > 0.3`
- Active tool loops: `sum(openclaw_lens_tool_loops_active) > 0`
- Prompt injection signals: `sum(increase(openclaw_lens_prompt_injection_signals_total[1h])) > 0`
- Cost spike: `openclaw_lens_daily_cost_usd > 10`
- Unique session anomaly: `openclaw_lens_unique_sessions_1h > 50`
- Gateway restarts: `sum(increase(openclaw_lens_gateway_restarts_total[24h])) > 3`
- Tool error classification: `sum by (error_class) (rate(openclaw_lens_tool_error_classes_total[5m]))`

**Security LogQL queries**:
- Security events: `{service_name="openclaw"} | json | event_name=~"prompt_injection.detected|gateway.start|gateway.stop"`
- Tool loops: `{service_name="openclaw"} | json | component="diagnostic" | event_name="tool.loop"`
- Session resets: `{service_name="openclaw"} | json | event_name="session.reset"`

## Composed Workflow Examples

**"How's my agent doing?"**
1. `grafana_query` with `sum(increase(openclaw_lens_cost_by_model_total[1d])) or vector(0)` and `openclaw_lens_tokens_total` for quick numbers
2. Or `grafana_create_dashboard` with `llm-command-center` template + `grafana_share_dashboard` for a visual

**"Set up monitoring for my agent" / "What dashboards should I create?"**
1. `grafana_search` query: "LLM Command Center" — check for existing agent dashboards
2. `grafana_create_dashboard` template: "llm-command-center" — single-pane health: cost, sessions, errors, latency, cache, live feeds
3. Follow `suggestedNext` from the response — it chains you through the remaining templates (session-explorer → cost-intelligence → tool-performance → sre-operations → genai-observability)
4. Share dashboard URLs with user
5. Optional: `grafana_share_dashboard` to show a key panel inline

The 6 AI observability templates provide comprehensive OpenClaw monitoring with Loki log-to-trace correlation:
- **llm-command-center**: Single-pane health view (daily cost, sessions, error rate, P95 latency, cache hit rate, token flow, live error/session feeds, per-model latency, message type breakdown)
- **session-explorer**: Per-session deep-dive (trace hierarchy, LLM calls, tool calls, conversation flow, cost/token breakdown, TraceQL session spans, latency stats). THE killer feature.
- **cost-intelligence**: Financial analysis (daily/weekly/monthly trends, model/provider/token attribution pie charts, cache savings, per-session cost table, projected monthly cost, token efficiency)
- **tool-performance**: Tool reliability (leaderboard bar gauges, p95 latency ranking, error rates, tool traces, tool calls by model, error traces, duration heatmap)
- **sre-operations**: Infrastructure health (queue depth, webhooks, stuck sessions, tool loops, gen_ai error rate, context window pressure)
- **genai-observability**: Industry-standard AI monitoring using OTel gen_ai semantic conventions — token analytics, LLM latency heatmap, trace explorer, log intelligence, cache efficiency. Works with any gen_ai data, not just OpenClaw.

**"Find error logs in the last hour"**
1. `grafana_explore_datasources` — find Loki datasource UID
2. `grafana_query_logs` with `{job="api"} |= "error"`

**"Why did my service crash?"**
1. `grafana_explore_datasources` — find Loki UID
2. `grafana_query_logs` with `{job="myservice"} |~ "fatal|panic|crash"`
3. `grafana_query` — check related metrics around the same time window

**"Investigate alert: correlate metrics with logs"**
1. `grafana_check_alerts` — get alert details with `suggestedInvestigation` (includes ready-to-use query, datasource, and tool)
2. `grafana_query` / `grafana_query_logs` — use `suggestedInvestigation.tool` with `suggestedInvestigation.condition` and `suggestedInvestigation.datasourceUid`
3. `grafana_query_logs` — search logs around alert time for root cause (if PromQL alert)
4. `grafana_annotate` — mark findings on dashboard
5. `grafana_check_alerts` with `action: "acknowledge"`

**"What data do I have in Grafana?"**
1. `grafana_explore_datasources` — find Prometheus datasources
2. `grafana_list_metrics` with `datasourceUid` and `metadata: true` — list available metrics grouped by category
3. Summarize what's available using `categorySummary` for quick overview (e.g., "5 cost metrics, 8 session metrics, 3 queue metrics")

**"My agent feels slow — what performance metrics can I look at?"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_list_metrics` with `purpose: "performance"`, `metadata: true` — returns only session + tool metrics (latency, duration, tool calls)
3. `grafana_explain_metric` on key metrics (e.g., `openclaw_lens_session_latency_avg_ms`, `openclaw_lens_tool_duration_ms`) — get current values and trends

**"Alert me if costs exceed $10/day"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_create_alert` with `expr: "openclaw_lens_daily_cost_usd"`, `threshold: 10`, `condition: "gt"`

**"Query error rate and alert if it's above 1%"** (query→alert chaining)
1. `grafana_query` with `expr: "rate(http_errors_total[5m])"` — response includes `datasourceUid`
2. `grafana_create_alert` with `datasourceUid` from the query response, same `expr`, `threshold: 0.01`, `condition: "gt"` — no extra datasource discovery needed

**"Show me my cost trends"**
1. `grafana_search` with `query: "cost"` — check for existing dashboard
2. If not found: `grafana_create_dashboard` with `cost-intelligence` template
3. `grafana_get_dashboard` — find the right panel ID
4. `grafana_share_dashboard` — deliver the chart inline

**"Re-run the latency panel from my Command Center for the last 7 days"**
1. `grafana_search` with `query: "Command Center"` — get the dashboard UID
2. `grafana_get_dashboard` with `uid` — find the latency panel ID
3. `grafana_query` with `dashboardUid`, `panelId`, `queryType: "range"`, `start: "now-7d"` — auto-resolves PromQL and datasource from the panel

**"Mark that I deployed version 2"**
1. `grafana_annotate` with `text: "Deployed v2"`, `tags: ["deploy"]`

**"I deployed v2.3.0 — compare error rates before and after"**
1. `grafana_annotate` with `text: "Deployed v2.3.0"`, `tags: ["deploy"]` — response includes `comparisonHint` with before/after time windows
2. `grafana_query` with `comparisonHint.beforeWindow` time range — get pre-deploy error rate
3. `grafana_query` with `comparisonHint.afterWindow` time range — get post-deploy error rate
4. Compare and report the difference to the user

**"What happened around 3pm?"**
1. `grafana_annotate` with `action: "list"`, `from`/`to` relative times (e.g., `"now-6h"`) or epoch ms, or specific `tags`

**"Monitor my server's health"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_search` with `query: "node"` — check for existing
3. `grafana_create_dashboard` with `template: "node-exporter"` — user picks instance from dropdown

**"Show me my e-commerce metrics"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_list_metrics` with `prefix: "orders_"` — discover available metrics
3. `grafana_create_dashboard` with `template: "multi-kpi"` — user picks 4 business metrics from dropdowns

**"I want to explore my fitness data"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_create_dashboard` with `template: "metric-explorer"` — user browses all `fitness_*` metrics from dropdown

**"Set up HTTP API monitoring with alerts"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_create_dashboard` with `template: "http-service"`
3. `grafana_create_alert` with `expr: "http_requests_total{code=~\"5..\"}"`, `threshold: 0.1`, `evaluation: "rate"` — the tool wraps in `rate()` automatically

**"Build a custom Redis dashboard"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_list_metrics` with `metadata: true`, `prefix: "redis_"` — learn metric types
3. Compose custom `dashboard` JSON using panel snippets from the composition guide
4. `grafana_create_dashboard` with the custom JSON — check `validation.panelsError` for broken queries, fix any errors with `grafana_update_dashboard`

**"Set up full alert monitoring loop"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_check_alerts` with `action: "setup"` — create webhook contact point
3. `grafana_create_alert` with PromQL condition — rule auto-routes to webhook via `managed_by=openclaw` label
4. When alert fires, agent sees it in prompt context and can investigate

**"List all my alert rules and delete one"**
1. `grafana_check_alerts` with `action: "list_rules"` — get all configured rules with UIDs, titles, conditions, AND live eval state (normal/firing/nodata/error)
2. Identify the target rule by title, condition, or state
3. `grafana_check_alerts` with `action: "delete_rule"`, `ruleUid` — permanently remove the rule

**"Investigate a firing alert"** (triggered by "GRAFANA ALERTS" in prompt context)
1. `grafana_investigate` with `focus` = alert UID or symptom, `timeWindow: "1h"` — parallel multi-signal triage (metrics, logs, traces, context)
2. `grafana_check_alerts` with `action: "silence"`, `matchers` from alert's `commonLabels` — prevent repeat notifications
3. Follow `suggestedHypotheses[].testWith` from investigate response — deep-dive with specific tools
4. `grafana_query_logs` — search logs around alert time for root cause (statistics first: `count_over_time`, then samples)
5. `grafana_query_traces` — inspect error/slow traces for span-level detail
6. `grafana_annotate` — mark findings on dashboard
7. `grafana_check_alerts` with `action: "acknowledge"` — mark as investigated
8. `grafana_check_alerts` with `action: "unsilence"`, `silenceId` — restore notifications
9. Report findings using Evidence Presentation Format (see sre-investigation.md §10)

**"Is this metric anomalous?" / "Statistical assessment"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_explain_metric` with `period: "24h"` — response includes `anomaly` (z-score, severity) and `seasonality` (vs 1d/7d ago)
3. If `anomaly.severity` >= "significant": `grafana_query` with z-score PromQL for time series view
4. `grafana_query` with `predict_linear(METRIC[6h], 3600)` — where is it heading?
5. Interpret: report σ-level, compare with yesterday/last week, forecast trajectory

**"Alert fatigue analysis — which alerts are noisy?"**
1. `grafana_check_alerts` with `action: "analyze"` — fatigue report with always-firing, flapping, and healthy classifications
2. For always-firing rules: suggest raising thresholds or adding `for:` duration
3. For flapping rules: suggest adding hysteresis or widening threshold bands
4. `grafana_check_alerts` with `action: "silence"` for rules being tuned

**"Generate a postmortem for the last incident"**
1. `grafana_investigate` with `focus` = incident symptom, `timeWindow: "6h"` — gather all evidence
2. `grafana_check_alerts` (list) — find resolved alerts with timeline
3. `grafana_annotate` (list, tags: ["investigation"]) — build event timeline
4. `grafana_explain_metric` for key metrics — trend/stats for incident period
5. `grafana_query_logs` — error patterns (statistics first)
6. `grafana_query_traces` — error traces for root cause evidence
7. Synthesize into blameless postmortem (see sre-investigation.md §9)

**"Add an error rate panel to my API dashboard"**
1. `grafana_search` with `query: "API"` — find the dashboard
2. `grafana_get_dashboard` — get current panels and structure
3. `grafana_update_dashboard` with `operation: "add_panel"`, `panel: { title: "Error Rate", type: "timeseries", targets: [...] }`

**"Remove the old latency panel and add a new one"**
1. `grafana_get_dashboard` — find panel IDs
2. `grafana_update_dashboard` with `operation: "remove_panel"`, `panelId: <old_id>`
3. `grafana_update_dashboard` with `operation: "add_panel"`, `panel: { ... }`

**"Change my dashboard time range to 7 days"**
1. `grafana_update_dashboard` with `operation: "update_metadata"`, `time: { "from": "now-7d", "to": "now" }`

**"Send my team the weekly dashboard"**
1. `grafana_search` with the dashboard name
2. `grafana_get_dashboard` — get panel IDs
3. `grafana_share_dashboard` for each key panel with `from: "now-7d"`

**"Show me my weekly work review"**
1. `grafana_push_metrics` — push work data (commits, hours, meetings)
2. `grafana_create_dashboard` with `template: "weekly-review"` — weekly trends + all custom metrics table
3. `grafana_share_dashboard` — deliver key panels inline

**"Track my daily fitness data"**
1. `grafana_push_metrics` — push fitness metrics (steps, weight, calories). Response has `queryNames` with exact PromQL names.
2. `grafana_create_dashboard` with `template: "weekly-review"` — weekly trends
3. `grafana_create_alert` using `queryNames` from step 1 for exact PromQL — e.g., `threshold: 5000`, `condition: "lt"`

**"Backfill my step count for the past week"**
1. `grafana_push_metrics` — push 7 data points with `timestamp` on each:
   `{ "metrics": [{ "name": "steps", "value": 8000, "timestamp": "2025-01-13" }, { "name": "steps", "value": 10500, "timestamp": "2025-01-14" }, ...] }`
2. `grafana_create_dashboard` with `template: "metric-explorer"` — visualize the trend
3. `grafana_share_dashboard` with `from: "now-7d"` — deliver the chart

**"What custom data have I pushed?"**
1. `grafana_push_metrics` with `action: "list"` — see all custom metric definitions with `queryName` per metric
2. `grafana_query` using `queryName` values — see current values
3. Or `grafana_list_metrics` with `search: "ext"` — server-side discovery from Prometheus

**"System health check — give me a full status report"**
1. `grafana_explore_datasources` — discover available datasources
2. `grafana_check_alerts` — check pending alerts + `action: "list_rules"` for configured rules
3. `grafana_search` with `query: ""` and `enrich: true` — find all dashboards with `updatedAt` and `panelCount` for staleness triage
4. `grafana_get_dashboard` with `audit: true` for stale or suspicious dashboards — check which panels have data vs broken
5. Summarize: datasources available, alerts firing, dashboard health (`auditSummary`), stale dashboards, broken panels

**"Audit my dashboard — which panels are broken?"**
1. `grafana_get_dashboard` with `audit: true` — dry-runs every panel's query
2. Review `auditSummary` for counts (`ok`, `nodata`, `error`, `skipped`)
3. For `error` panels, explain the issue; for `nodata` panels, suggest fixes (missing metrics, wrong datasource)

**"What is this metric doing?"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_explain_metric` with metric name — get current value, trend, stats, metadata
3. Interpret results for the user — the tool provides data, you provide narrative

**"Compare costs this week vs. last week" / "Did the new model help?"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_explain_metric` with `period: "7d"`, `compareWith: "previous"` — one call gives this week's stats + last week's stats + percentage change
3. Interpret the `comparison.change` object — "costs are down 25% week-over-week, the model switch is working"

**"Compare metric over different time scales"**
1. `grafana_explain_metric` with `period: "24h"` — recent trend
2. `grafana_explain_metric` with `period: "7d"` — weekly context
3. Compare and explain — "up 15% today but down 5% over the week"

**"Where are my costs going?" / "Why did my costs spike?"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_explain_metric` with `openclaw_lens_cost_by_token_type` — get trend + `suggestedBreakdowns` (always: `["model", "token_type", "provider"]`) + `suggestedQueries`
3. Use `suggestedBreakdowns` to know which dimensions matter, or pick from `suggestedQueries` for ready-to-use PromQL
4. `grafana_query` with `sum by (model) (rate(openclaw_lens_cost_by_token_type[5m]))` — get concrete breakdown numbers
5. Explain to user: "Opus accounts for 80% of costs, mainly from output tokens"

**"Set up AI observability" / "Show me LLM traces and logs"**
1. `grafana_explore_datasources` — verify Prometheus, Loki, and Tempo datasources are available
2. `grafana_search` with `query: "LLM Command Center"` — check for existing
3. `grafana_create_dashboard` with `template: "llm-command-center"` — health overview with live error/session feeds
4. `grafana_create_dashboard` with `template: "session-explorer"` — per-session trace drill-down
5. `grafana_share_dashboard` — deliver an LLM latency or token usage panel inline

**"Why is my LLM slow?" (multi-signal investigation)**
1. `grafana_explain_metric` with `gen_ai_client_operation_duration_seconds` — check latency trend
2. `grafana_query_logs` with `{service_name="openclaw"} | json | component="lifecycle" |= "llm.output"`, `extractFields: true` — find slow LLM calls with `fields.duration_s` and `fields.model` surfaced directly
3. `grafana_query_traces` with `queryType: "get"` and `fields.trace_id` from step 2 — inspect spans; response includes `correlationHint.logQuery` for correlated logs
4. `grafana_query` with model-specific duration breakdown — identify which model is slow

**"Debug a slow or failing session"**
1. `grafana_query_traces` with `{ resource.service.name = "openclaw" && status = error }` or `minDuration: "10s"` — find problematic traces
2. `grafana_query_traces` with `queryType: "get"` and the trace ID — inspect span hierarchy (response includes `correlationHint`)
3. `grafana_query_logs` with the `correlationHint.logQuery` from step 2, `extractFields: true` — correlated logs
4. `grafana_query` with relevant PromQL — check metrics around the same time window
5. `grafana_annotate` with `text: "Debug: [findings]"`, `tags: ["debug"]` — mark investigation on dashboards

**"Which dashboards are stale?" / "Give me a dashboard summary report"**
1. `grafana_search` with `query: ""` and `enrich: true` — gets `updatedAt`, `panelCount`, `folderTitle` for every dashboard
2. Sort by `updatedAt` to identify stale dashboards (e.g., not updated in 30+ days)
3. Summarize: total dashboards, by folder, stale vs active, panel counts

**"Am I being attacked?" / "Security status" / "Security audit"**
1. `grafana_security_check` — runs 6 parallel security checks, returns threat level (green/yellow/red)
2. If non-green: follow `suggestedActions` from the response
3. If `dashboardTemplate` is present: `grafana_create_dashboard` with `template: "security-overview"` — persistent monitoring
4. `grafana_create_alert` using suggested thresholds for ongoing monitoring

**"Set up security monitoring"**
1. `grafana_check_alerts` with `action: "setup"` — create webhook contact point (if not already done)
2. `grafana_create_dashboard` with `template: "security-overview"` — 15-panel security dashboard
3. `grafana_create_alert` with `expr: "rate(openclaw_lens_webhook_error_total[5m]) / (rate(openclaw_lens_webhook_received_total[5m]) + 0.001)"`, `threshold: 0.3`, `condition: "gt"` — webhook error burst
4. `grafana_create_alert` with `expr: "openclaw_lens_daily_cost_usd"`, `threshold: 10`, `condition: "gt"` — cost spike
5. `grafana_create_alert` with `expr: "sum(increase(openclaw_lens_prompt_injection_signals_total[1h]))"`, `threshold: 3`, `condition: "gt"` — prompt injection signals

**"Investigate security alert"**
1. `grafana_security_check` — get current threat assessment
2. `grafana_query_logs` with `{service_name="openclaw"} | json | component="lifecycle" | event_name=~"prompt_injection.detected|gateway.start|gateway.stop"` — security event logs
3. `grafana_query` with specific PromQL for the affected check (e.g., `rate(openclaw_lens_tool_error_classes_total[5m])`)
4. `grafana_annotate` with `text: "Security investigation"`, `tags: ["security"]` — mark investigation on dashboards
5. `grafana_check_alerts` with `action: "acknowledge"` — mark alerts as investigated

**"Clean up old test dashboards"**
1. `grafana_search` with `query: "test"` or relevant tags and `enrich: true` — includes `updatedAt` for staleness check
2. Confirm with user which dashboards to delete
3. `grafana_update_dashboard` with `operation: "delete"` for each confirmed dashboard

**"Find my production dashboards"**
1. `grafana_search` with `tags: ["production"]` or `starred: true`

**"Why is the queue backing up?"**
1. `grafana_explore_datasources` — get Prometheus UID
2. `grafana_query` with `openclaw_lens_queue_depth` — current queue depth
3. `grafana_query` with `sum(rate(openclaw_message_queued_total[5m])) by (openclaw_source)` — message inflow by source
4. `grafana_query` with `histogram_quantile(0.95, rate(openclaw_queue_wait_ms_milliseconds_bucket[5m]))` — p95 queue wait time
5. `grafana_query` with `sum(rate(openclaw_queue_lane_enqueue_total[5m]))` vs `sum(rate(openclaw_queue_lane_dequeue_total[5m]))` — enqueue vs dequeue rate
6. `grafana_query` with `openclaw_lens_sessions_stuck` — check for stuck sessions blocking the queue
7. Diagnose: if inflow > drain → scale issue; if stuck sessions → investigate with `grafana_explain_metric`

## Dashboard Composition

For composing custom dashboards from discovered metrics — template selection, metric-to-panel mapping, JSON snippets, worked examples across domains — see [references/dashboard-composition.md](references/dashboard-composition.md).

## Data Collection Pipeline Workflows (Alloy)

**"Monitor my PostgreSQL database"**
1. `alloy_pipeline` with recipe `postgres-exporter`, params `{ connectionString: "postgres://..." }`, name `analytics-db`
2. User sets `ALLOY_POSTGRES_EXPORTER_ANALYTICS_DB_CONNECTIONSTRING` env var where Alloy runs
3. `alloy_pipeline` action `status`, name `analytics-db` — verify data flowing
4. `grafana_list_metrics` with search `pg_` — discover available PostgreSQL metrics
5. `grafana_create_dashboard` with template `metric-explorer`, title `Analytics DB Health`
6. `grafana_create_alert` — `pg_up{job="analytics-db"} == 0` (database down)

**"Set up full-stack Kubernetes monitoring"**
1. `alloy_pipeline` recipe `kubernetes-pods` (auto-discovers pod metrics)
2. `alloy_pipeline` recipe `kubernetes-services` (scrapes service endpoints)
3. `alloy_pipeline` recipe `kubernetes-logs` (collects pod logs to Loki)
4. `alloy_pipeline` action `diagnose` — verify all 3 healthy
5. `grafana_create_dashboard` with template `multi-kpi` — cluster health
6. `grafana_create_alert` — pod restart rate, OOMKills

**"Collect Docker container logs and search errors"**
1. `alloy_pipeline` recipe `docker-logs`
2. `alloy_pipeline` action `status`, name `docker-logs`
3. `grafana_query_logs` with `{source="docker"} |= "error"`, start `now-1h`

**"Parse JSON fields from Kafka logs"**
1. `alloy_pipeline` recipe `kafka-logs`, params `{ brokers: ["kafka:9092"], topics: ["app-logs"], jsonExpressions: { level: "", message: "", request_id: "ctx.rid" }, labelFields: { level: "" }, structuredMetadata: { request_id: "" } }`
2. `alloy_pipeline` action `status`, name `kafka-logs`
3. `grafana_query_logs` with `{ expr: '{source="kafka"} | level="error"' }`

**"Generate RED metrics from traces"**
1. `alloy_pipeline` recipe `span-metrics`, params `{ dimensions: ["http.method", "http.status_code"] }`
2. `alloy_pipeline` action `status`
3. `grafana_query` with `{ expr: 'sum(rate(traces_spanmetrics_calls_total[5m]))' }`
4. `grafana_create_dashboard` template `metric-explorer`

**"Monitor endpoint availability"**
1. `alloy_pipeline` recipe `blackbox-exporter`, params `{ targets: [{ name: "api", address: "https://api.example.com", module: "http_2xx" }] }`
2. `alloy_pipeline` action `status`
3. `grafana_create_alert` with `{ expr: 'probe_success == 0', name: 'Endpoint Down' }`

**"My app has /metrics at myapp:9090 — dashboard + alerts"**
1. `alloy_pipeline` recipe `scrape-endpoint`, params `{ url: "http://myapp:9090/metrics" }`
2. `alloy_pipeline` action `status` — verify scraping
3. `grafana_list_metrics` — discover what's exposed
4. `grafana_create_dashboard` using discovered metrics
5. `grafana_create_alert` for critical metrics

**"Set up Linux system monitoring"**
1. `alloy_pipeline` with recipe `node-exporter`, name `linux-metrics` — system CPU, memory, disk metrics
2. `alloy_pipeline` with recipe `journal-logs`, name `linux-journal` — systemd journal logs
3. `alloy_pipeline` with recipe `file-logs`, name `linux-syslog`, params `{ paths: ["/var/log/syslog", "/var/log/messages"] }` — traditional syslog files
4. `alloy_pipeline` action `diagnose` — verify all 3 healthy
5. `grafana_create_dashboard` with template `metric-explorer`, title `Linux System Health`

**"Are all my data collection pipelines healthy?"**
1. `alloy_pipeline` action `diagnose` — connectivity, all pipeline health, drift, limits
2. For degraded pipelines: `alloy_pipeline` action `status` for details
3. Follow `remediation` field for fixes

## Agent Metrics

For metric names, types, labels, and common PromQL — covering both `openclaw_*` (from diagnostics-otel) and `openclaw_lens_*` (from grafana-lens) — see [references/agent-metrics.md](references/agent-metrics.md).

## External Data

For external data naming conventions and integration patterns, see [references/external-data.md](references/external-data.md).

## SRE Investigation Patterns

For 5-phase investigation methodology, anomaly detection PromQL, RED/USE method queries,
LogQL investigation discipline, TraceQL debugging, SLI/SLO burn rates, cross-signal correlation,
and composed investigation workflows — see [references/sre-investigation.md](references/sre-investigation.md).

## Alloy Pipeline Recipes

For the full 26-recipe catalog with detailed parameters, Alloy component mappings, troubleshooting,
and advanced usage patterns — see [references/alloy-pipelines.md](references/alloy-pipelines.md).

For component snippets for raw-config (escape hatch) pipelines — see [references/alloy-components.md](references/alloy-components.md).
Related Skills

multi-lens

3891
from openclaw/skills
当用户分享一个来自书籍、人物或思想的观点时，启动两位辩手从不同视角深度辩论该观点，帮助用户从中道立场全面理解它。触发词包括：「辩一辩」「帮我分析这个观点」「我在读xxx，书里说...」「xxx说过...」「辩论一下」「多角度看」，或用户直接粘贴一段引文或笔记。只要用户在分享一个观点并希望深度理解，就应该触发这个 skill（multi-lens）。
---

3891
from openclaw/skills
name: article-factory-wechat
Content & Documentation
humanizer

3891
from openclaw/skills
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.
Content & Documentation
find-skills

3891
from openclaw/skills
Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
General Utilities
tavily-search

3891
from openclaw/skills
Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.
Data & Research
baidu-search

3891
from openclaw/skills
Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.
Data & Research
agent-autonomy-kit

3891
from openclaw/skills
Stop waiting for prompts. Keep working.
Workflow & Productivity
Meeting Prep

3891
from openclaw/skills
Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.
Workflow & Productivity
self-improvement

3891
from openclaw/skills
Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.
Agent Intelligence & Learning
botlearn-healthcheck

3891
from openclaw/skills
botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.
DevOps & Infrastructure