logging-observability

Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.

7 stars

bywpank

View on GitHub Installation ↓

Best use case

logging-observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using logging-observability should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/logging-observability/SKILL.md --create-dirs "https://raw.githubusercontent.com/wpank/ai/main/skills/tools/logging-observability/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/logging-observability/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How logging-observability Compares

Feature / Agent	logging-observability	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Logging & Observability

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

## Three Pillars

| Pillar | Purpose | Question It Answers | Example |
|--------|---------|---------------------|---------|
| **Logs** | What happened | Why did this request fail? | `{"level":"error","msg":"payment declined","user_id":"u_82"}` |
| **Metrics** | How much / how fast | Is latency increasing? | `http_request_duration_seconds{route="/api/orders"} 0.342` |
| **Traces** | Request flow | Where is the bottleneck? | Span: `api-gateway → auth → order-service → db` |

Each pillar is strongest when correlated. Embed `trace_id` in every log line to jump from a log entry to the full distributed trace.


## Installation

### OpenClaw / Moltbot / Clawbot

```bash
npx clawhub@latest install logging-observability
```


---

## Structured Logging

Always emit logs as structured JSON — never free-text strings.

### Required Fields

| Field | Purpose | Required |
|-------|---------|----------|
| `timestamp` | ISO-8601 with milliseconds | Yes |
| `level` | Severity (DEBUG … FATAL) | Yes |
| `service` | Originating service name | Yes |
| `message` | Human-readable description | Yes |
| `trace_id` | Distributed trace correlation | Yes |
| `span_id` | Current span within trace | Yes |
| `correlation_id` | Business-level correlation (order ID) | When applicable |
| `error` | Structured error object | On errors |
| `context` | Request-specific metadata | Recommended |

### Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically:

```typescript
app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    user_id: req.user?.id,
    method: req.method,
    path: req.path,
  };
  asyncLocalStorage.run(ctx, () => next());
});
```

### Library Recommendations

| Library | Language | Strengths | Perf |
|---------|----------|-----------|------|
| **Pino** | Node.js | Fastest Node logger, low overhead | Excellent |
| **structlog** | Python | Composable processors, context binding | Good |
| **zerolog** | Go | Zero-allocation JSON logging | Excellent |
| **zap** | Go | High performance, typed fields | Excellent |
| **tracing** | Rust | Spans + events, async-aware | Excellent |

Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.

---

## Log Levels

| Level | When to Use | Example |
|-------|-------------|---------|
| **FATAL** | App cannot continue, process will exit | Database connection pool exhausted |
| **ERROR** | Operation failed, needs attention | Payment charge failed: CARD_DECLINED |
| **WARN** | Unexpected but recoverable | Retry 2/3 for upstream timeout |
| **INFO** | Normal business events | Order ORD-1234 placed successfully |
| **DEBUG** | Developer troubleshooting | Cache miss for key user:82:preferences |
| **TRACE** | Very fine-grained (rarely in prod) | Entering validateAddress with payload |

**Rules:** Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.

---

## Distributed Tracing

### OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs:

```typescript
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
```

### Span Creation

```typescript
const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total_cents', order.totalCents);
      await validateInventory(order);
      await chargePayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}
```

### Context Propagation

- Use W3C Trace Context (`traceparent` header) — default in OTel
- Propagate across HTTP, gRPC, and message queues
- For async workers: serialise `traceparent` into the job payload

### Trace Sampling

| Strategy | Use When |
|----------|----------|
| **Always On** | Low-traffic services, debugging |
| **Probabilistic** (N%) | General production use |
| **Rate-limited** (N/sec) | High-throughput services |
| **Tail-based** | When you need all error traces |

Always sample 100% of error traces regardless of strategy.

---

## Metrics Collection

### RED Method (Request-Driven)

Monitor these three for every service endpoint:

| Metric | What It Measures | Prometheus Example |
|--------|-----------------|-------------------|
| **Rate** | Requests/sec | `rate(http_requests_total[5m])` |
| **Errors** | Failed request ratio | `rate(http_requests_total{status=~"5.."}[5m])` |
| **Duration** | Response time | `histogram_quantile(0.99, http_request_duration_seconds)` |

### USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network):

| Metric | What It Measures | Example |
|--------|-----------------|---------|
| **Utilization** | % resource busy | CPU usage at 78% |
| **Saturation** | Work queued/waiting | 12 requests queued in thread pool |
| **Errors** | Error events on resource | 3 disk I/O errors in last minute |

---

## Monitoring Stack

| Tool | Category | Best For |
|------|----------|----------|
| **Prometheus** | Metrics | Pull-based metrics, alerting rules |
| **Grafana** | Visualisation | Dashboards for metrics, logs, traces |
| **Jaeger** | Tracing | Distributed trace visualisation |
| **Loki** | Logs | Log aggregation (pairs with Grafana) |
| **OpenTelemetry** | Collection | Vendor-neutral telemetry collection |

**Recommendation:** Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.

---

## Alert Design

### Severity Levels

| Severity | Response Time | Example |
|----------|---------------|---------|
| **P1** | Immediate | Service fully down, data loss |
| **P2** | < 30 min | Error rate > 5%, latency p99 > 5s |
| **P3** | Business hours | Disk > 80%, cert expiring in 7 days |
| **P4** | Best effort | Non-critical deprecation warning |

### Alert Fatigue Prevention

- **Alert on symptoms, not causes** — "error rate > 5%" not "pod restarted"
- **Multi-window, multi-burn-rate** — catch both sudden spikes and slow burns
- **Require runbook links** — every alert must link to diagnosis and remediation
- **Review monthly** — delete or tune alerts that never fire or always fire
- **Group related alerts** — use inhibition rules to suppress child alerts
- **Set appropriate thresholds** — if alert fires daily and is ignored, raise threshold or delete

---

## Dashboard Patterns

### Overview Dashboard ("War Room")
- Total requests/sec across all services
- Global error rate (%) with trendline
- p50 / p95 / p99 latency
- Active alerts count by severity
- Deployment markers overlaid on graphs

### Service Dashboard (Per-Service)
- RED metrics for each endpoint
- Dependency health (upstream/downstream success rates)
- Resource utilisation (CPU, memory, connections)
- Top errors table with count and last seen

---

## Observability Checklist

Every service must have:

- [ ] Structured JSON logging with consistent schema
- [ ] Correlation / trace IDs propagated on all requests
- [ ] RED metrics exposed for every external endpoint
- [ ] Health check endpoints (`/healthz` and `/readyz`)
- [ ] Distributed tracing with OpenTelemetry
- [ ] Dashboards for RED metrics and resource utilisation
- [ ] Alerts for error rate, latency, and saturation with runbook links
- [ ] Log level configurable at runtime without redeployment
- [ ] PII scrubbing verified and tested
- [ ] Retention policies defined for logs, metrics, and traces

## Anti-Patterns

| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references |
| Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow |
| Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema |
| String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message |
| Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere |
| Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication |
| Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |

## NEVER Do

1. **NEVER log passwords, tokens, API keys, or secrets** — even at DEBUG level
2. **NEVER use console.log / print in production** — use a structured logger
3. **NEVER use user IDs, emails, or request IDs as metric labels** — cardinality will explode
4. **NEVER create alerts without a runbook link** — unactionable alerts erode trust
5. **NEVER rely on logs alone** — you need metrics and traces for full observability
6. **NEVER log request/response bodies by default** — opt-in only, with PII redaction
7. **NEVER ignore log volume** — set budgets and alert when a service exceeds daily quota
8. **NEVER skip context propagation in async flows** — broken traces are worse than no traces

Related Skills

schema-markup

from wpank/ai

Add, fix, or optimize schema markup and structured data. Use when the user mentions schema markup, structured data, JSON-LD, rich snippets, schema.org, FAQ schema, product schema, review schema, or breadcrumb schema.

prompt-engineering

from wpank/ai

Master advanced prompt engineering techniques to maximize LLM performance, reliability, and controllability in production. Use when optimizing prompts, improving LLM outputs, designing production prompt templates, or building AI-powered features.

professional-communication

from wpank/ai

Write effective professional messages for software teams. Use when drafting emails, Slack/Teams messages, meeting agendas, status updates, or translating technical concepts for non-technical audiences. Triggers on email, slack, teams, message, meeting agenda, status update, stakeholder communication, escalation, jargon translation.

persona-docs

from wpank/ai

Create persona documentation for a product or codebase. Use when asked to create persona docs, document target users, define user journeys, document onboarding flows, or when starting a new product and needing to define its audience. Persona docs should be the first documentation created for any product.

mermaid-diagrams

from wpank/ai

Create software diagrams using Mermaid syntax. Use when users need to create, visualize, or document software through diagrams including class diagrams, sequence diagrams, flowcharts, ERDs, C4 architecture diagrams, state diagrams, git graphs, and other diagram types. Triggers include requests to diagram, visualize, model, map out, or show the flow of a system.

game-changing-features

from wpank/ai

Find 10x product opportunities and high-leverage improvements. Use when the user wants strategic product thinking, mentions 10x, wants to find high-impact features, or asks what would make a product dramatically more valuable.

clear-writing

from wpank/ai

Write clear, concise prose for humans — documentation, READMEs, API docs, commit messages, error messages, UI text, reports, and explanations. Combines Strunk's rules for clearer prose with technical documentation patterns, structure templates, and review checklists.

brainstorming

from wpank/ai

Explore ideas before implementation through collaborative dialogue. Use before any creative work — creating features, building components, adding functionality, or modifying behavior. Turns ideas into fully formed designs and specs through structured conversation.

Article Illustrator

from wpank/ai

When the user wants to add illustrations to an article or blog post. Triggers on: "illustrate article", "add images to article", "generate illustrations", "article images", or requests to visually enhance written content. Analyzes article structure, identifies positions for visual aids, and generates illustrations using a Type x Style two-dimension approach.

subagent-driven-development

from wpank/ai

Execute implementation plans by dispatching a fresh subagent per task with two-stage review (spec compliance then code quality). Use when you have an implementation plan with mostly independent tasks and want high-quality, fast iteration within a single session.

skill-judge

from wpank/ai

Evaluate Agent Skill quality against official specifications. Use when reviewing SKILL.md files, auditing skill packages, improving skill design, or checking if a skill follows best practices. Provides 8-dimension scoring (120 points) with actionable improvements. Triggers on review skill, evaluate skill, audit skill, improve skill, skill quality, SKILL.md review.

skill-creator

from wpank/ai

WHAT: Guide for creating effective AI agent skills - modular packages that extend Claude's capabilities with specialized knowledge, workflows, and tools. WHEN: User wants to create, write, author, or update a skill. User asks about skill structure, SKILL.md format, or how to package domain knowledge for AI agents. KEYWORDS: "create a skill", "make a skill", "new skill", "skill template", "SKILL.md", "agent skill", "write a skill", "skill structure", "package a skill"