Implementing Observability
Instrument the application with Logging, Metrics, and Tracing (OpenTelemetry) to understand system behavior and debug production issues.
Best use case
Implementing Observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Instrument the application with Logging, Metrics, and Tracing (OpenTelemetry) to understand system behavior and debug production issues.
Teams using Implementing Observability should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/implementing-observability/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Implementing Observability Compares
| Feature / Agent | Implementing Observability | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Instrument the application with Logging, Metrics, and Tracing (OpenTelemetry) to understand system behavior and debug production issues.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Implementing Observability
## Goal
Make the system's internal state inferable from its external outputs. Answer "Why is it slow?" and "Why did it fail?" without SSH-ing into a server.
## When to Use
- Before launching to production.
- When debugging a performance bottleneck.
- When integrating a new microservice or external API.
## Instructions
### 1. Structured Logging
Text logs are hard to query. Use JSON.
- **Context**: Every log must have `trace_id`, `request_id`, `user_id`.
- **Levels**: `INFO` for normal ops, `WARN` for handled issues, `ERROR` for unhandled crashes.
```json
{"level": "info", "msg": "User logged in", "user_id": 123, "trace_id": "abc-123"}
```
### 2. Distributed Tracing (OpenTelemetry)
Trace a request across boundaries (Frontend -> API -> DB).
- Instrument HTTP clients and server frameworks.
- Visualize the "waterfall" to find the slow span.
### 3. Golden Signals (Metrics)
Track the four key metrics for every service:
- **Latency**: Time to serve a request.
- **Traffic**: Request rate (RPS).
- **Errors**: Rate of 5xx responses.
- **Saturation**: CPU/Memory/Disk usage.
### 4. Alerting
Alert on symptoms (High Error Rate), not causes (High CPU).
- **Page**: If `Error Rate > 1%` for 5 minutes.
- **Ticket**: If `Disk Usage > 80%`.
## Constraints
### ✅ Do
- **DO**: Use OpenTelemetry standards for portability.
- **DO**: Correlate logs and traces (inject trace ID into logs).
- **DO**: Sample high-volume traces (10%) to save costs, but keep 100% of errors.
### ❌ Don't
- **DON'T**: Log PII (Emails, Passwords, Credit Cards).
- **DON'T**: Create alerts that auto-resolve in seconds (flapping).
- **DON'T**: Rely solely on "system up" checks; check "business logic working".
## Output Format
- `docker-compose.yml` with Prometheus/Grafana/Jaeger (for dev).
- Code instrumentation (e.g., `tracing.py`).
## Dependencies
- `backend/managing-flask-middleware/SKILL.md` (where instrumentation lives)
- `shared/debugging/SKILL.md`Related Skills
implementing-search-filter
Implements search and filter interfaces for both frontend (React/TypeScript) and backend (Python) with debouncing, query management, and database integration. Use when adding search functionality, building filter UIs, implementing faceted search, or optimizing search performance.
implementing-error-handling
Master error handling patterns across languages including exceptions, Result types, error propagation, and graceful degradation to build resilient applications. Use when implementing error handling, designing APIs, or improving application reliability.
observability-review
AI agent that analyzes operational signals (metrics, logs, traces, alerts, SLO/SLI reports) from observability platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana, Elastic) and produces practical, risk-aware triage and recommendations. Use when reviewing system health, investigating performance issues, analyzing monitoring data, evaluating service reliability, or providing SRE analysis of operational metrics. Distinguishes between critical issues requiring action, items needing investigation, and informational observations requiring no action.
implementing-android-code
This skill should be used when implementing Android code in Bitwarden. Covers critical patterns, gotchas, and anti-patterns unique to this codebase. Triggered by "How do I implement a ViewModel?", "Create a new screen", "Add navigation", "Write a repository", "BaseViewModel pattern", "State-Action-Event", "type-safe navigation", "@Serializable route", "SavedStateHandle persistence", "process death recovery", "handleAction", "sendAction", "Hilt module", "Repository pattern", "implementing a screen", "adding a data source", "handling navigation", "encrypted storage", "security patterns", "Clock injection", "DataState", or any questions about implementing features, screens, ViewModels, data sources, or navigation in the Bitwarden Android app.
implementing-rapid7-insightvm-for-scanning
Deploy and configure Rapid7 InsightVM Security Console and Scan Engines for authenticated and unauthenticated vulnerability scanning across enterprise environments.
implementing-navigation
Implements navigation patterns and routing for both frontend (React/TS) and backend (Python) including menus, tabs, breadcrumbs, client-side routing, and server-side route configuration. Use when building navigation systems or setting up routing.
implementing-api-patterns
API design and implementation across REST, GraphQL, gRPC, and tRPC patterns. Use when building backend services, public APIs, or service-to-service communication. Covers REST frameworks (FastAPI, Axum, Gin, Hono), GraphQL libraries (Strawberry, async-graphql, gqlgen, Pothos), gRPC (Tonic, Connect-Go), tRPC for TypeScript, pagination strategies (cursor-based, offset-based), rate limiting, caching, versioning, and OpenAPI documentation generation. Includes frontend integration patterns for forms, tables, dashboards, and ai-chat skills.
api-testing-observability-api-mock
You are an API mocking expert specializing in realistic mock services for development, testing, and demos. Design mocks that simulate real API behavior and enable parallel development.
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
moai-lang-{{LANGUAGE_SLUG}}
{{LANGUAGE_NAME}} best practices with modern frameworks, {{PRIMARY_DOMAIN}}, and performance optimization for 2025
moai-lang-elixir
Elixir 1.17+ development specialist covering Phoenix 1.7, LiveView, Ecto, and OTP patterns. Use when developing real-time applications, distributed systems, or Phoenix projects.
moai-lang-csharp
Enterprise C# 13 development with .NET 9, async/await, LINQ, Entity Framework Core, ASP.NET Core, and Context7 MCP integration for modern backend and enterprise applications.