AI Agent Skill HUB

ClaudeObservability & Monitoring

distributed-tracing

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

31,392 stars

Complexity: medium

View on GitHub Installation ↓

About this skill

This skill empowers an AI agent to assist in or orchestrate the setup of distributed tracing for complex microservice architectures. By leveraging industry-standard tools like Jaeger and Tempo, the agent can provide guidance on instrumenting services, configuring collectors, and visualizing request paths. The goal is to provide comprehensive request flow visibility, identify performance bottlenecks, and understand latency across disparate services, significantly improving debugging and operational efficiency in distributed systems. The agent can clarify goals, suggest best practices, and outline actionable steps for successful tracing implementation.

Best use case

Diagnosing latency issues in microservice architectures. Troubleshooting failed requests across multiple services. Gaining end-to-end visibility into complex request flows. Optimizing performance of distributed applications. Setting up observability for new microservice deployments.

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

A detailed plan or set of actionable steps for implementing distributed tracing. Guidance on configuring Jaeger and Tempo for a specific microservice environment. Recommendations for best practices in instrumenting services for tracing. An analysis of potential challenges and verification steps for a successful tracing setup. An outline of how to use tracing data to understand latency and request flows.

Practical example

Example input

I need to set up distributed tracing for my e-commerce microservices, which include a user service, product catalog, and order processing, to track requests and identify performance issues. Suggest a plan using Jaeger and Tempo.

Example output

{"plan": "Distributed Tracing Implementation Plan for E-commerce Microservices", "summary": "This plan outlines the steps to implement distributed tracing using Jaeger and Tempo across your user, product, and order microservices to gain visibility into request flows and performance.", "steps": [{"step_number": 1, "title": "Clarify Goals and System Architecture", "description": "Confirm specific performance metrics, error rates, and key request paths to monitor. Outline the tech stack for each microservice (language, frameworks) to inform instrumentation choices."}, {"step_number": 2, "title": "Set Up Jaeger and Tempo Infrastructure", "description": "Deploy Jaeger (Collector, Query, Agent) and Tempo (Distributor, Ingester, Querier, Compactor) either as Docker containers, Kubernetes deployments, or cloud services. Ensure proper persistent storage and network configuration."}, {"step_number": 3, "title": "Service Instrumentation with OpenTelemetry", "description": "Integrate OpenTelemetry SDKs into your User, Product, and Order services. Instrument key operations (e.g., API endpoints, database calls, inter-service communication) to generate traces. Ensure context propagation (e.g., W3C Trace Context headers) across service boundaries."}, {"step_number": 4, "title": "Configure OpenTelemetry Exporters", "description": "Configure OpenTelemetry exporters in each service to send traces to the Tempo backend via an OpenTelemetry Collector or directly to Jaeger (if using Jaeger as the primary backend for spans)."}, {"step_number": 5, "title": "Validate Trace Data Ingestion and Storage", "description": "Send test requests through your microservices. Verify that traces are being successfully collected by Tempo and are queryable via Tempo's querier or integrated visualization tools like Grafana. Check Jaeger UI if also used for visualization."}, {"step_number": 6, "title": "Implement Monitoring and Alerting", "description": "Set up Grafana dashboards to visualize trace data from Tempo. Create alerts for critical latency spikes, error rates, or specific trace patterns identified from your goals in Step 1."}, {"step_number": 7, "title": "Review and Optimize", "description": "Regularly review trace data to identify performance bottlenecks, optimize instrumentation, and refine dashboards/alerts. Train team members on how to effectively use the distributed tracing system for debugging and performance analysis."}], "best_practices": ["Use semantic conventions for span naming and attributes.", "Ensure consistent sampling strategies across services.", "Prioritize critical business transactions for detailed instrumentation.", "Integrate tracing with existing logging and metrics systems."], "verification_steps": ["Execute a full end-to-end transaction (e.g., user places an order).", "Query Tempo/Jaeger for the trace of that transaction.", "Confirm all expected spans and service interactions are visible.", "Verify latency measurements align with observed performance."]}

When to use this skill

When working with a microservice architecture.
When needing to understand the flow of requests across multiple services.
When performance bottlenecks or errors are difficult to diagnose in a distributed system.
When Jaeger and Tempo are the preferred distributed tracing tools.

When not to use this skill

The task is unrelated to distributed tracing.
You need a different domain or tool outside this scope.
The system is a monolithic application where distributed tracing is not applicable.
Different tracing tools (e.g., Zipkin, OpenTelemetry without Jaeger/Tempo backend) are required.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/distributed-tracing/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/distributed-tracing/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/distributed-tracing/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How distributed-tracing Compares

Feature / Agent	distributed-tracing	Standard Approach
Platform Support	Claude	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	medium	N/A

Frequently Asked Questions

What does this skill do?

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

AI Agent for YouTube Script Writing

Find AI agent skills for YouTube script writing, video research, content outlining, and repeatable channel production workflows.

SKILL.md Source

# Distributed Tracing

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

## Do not use this skill when

- The task is unrelated to distributed tracing
- You need a different domain or tool outside this scope

## Instructions

- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.

## Purpose

Track requests across distributed systems to understand latency, dependencies, and failure points.

## Use this skill when

- Debug latency issues
- Understand service dependencies
- Identify bottlenecks
- Trace error propagation
- Analyze request paths

## Distributed Tracing Concepts

### Trace Structure
```
Trace (Request ID: abc123)
  ↓
Span (frontend) [100ms]
  ↓
Span (api-gateway) [80ms]
  ├→ Span (auth-service) [10ms]
  └→ Span (user-service) [60ms]
      └→ Span (database) [40ms]
```

### Key Components
- **Trace** - End-to-end request journey
- **Span** - Single operation within a trace
- **Context** - Metadata propagated between services
- **Tags** - Key-value pairs for filtering
- **Logs** - Timestamped events within a span

## Jaeger Setup

### Kubernetes Deployment

```bash
# Deploy Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

# Deploy Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
  ingress:
    enabled: true
EOF
```

### Docker Compose

```yaml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686"  # UI
      - "14268:14268"  # Collector
      - "14250:14250"  # gRPC
      - "9411:9411"    # Zipkin
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
```

**Reference:** See `references/jaeger-setup.md`

## Application Instrumentation

### OpenTelemetry (Recommended)

#### Python (Flask)
```python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask

# Initialize tracer
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route('/api/users')
def get_users():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("get_users") as span:
        span.set_attribute("user.count", 100)
        # Business logic
        users = fetch_users_from_db()
        return {"users": users}

def fetch_users_from_db():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("database_query") as span:
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.statement", "SELECT * FROM users")
        # Database query
        return query_database()
```

#### Node.js (Express)
```javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

// Initialize tracer
const provider = new NodeTracerProvider({
  resource: { attributes: { 'service.name': 'my-service' } }
});

const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces'
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Instrument libraries
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

const express = require('express');
const app = express();

app.get('/api/users', async (req, res) => {
  const tracer = trace.getTracer('my-service');
  const span = tracer.startSpan('get_users');

  try {
    const users = await fetchUsers();
    span.setAttributes({ 'user.count': users.length });
    res.json({ users });
  } finally {
    span.end();
  }
});
```

#### Go
```go
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func getUsers(ctx context.Context) ([]User, error) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "get_users")
    defer span.End()

    span.SetAttributes(attribute.String("user.filter", "active"))

    users, err := fetchUsersFromDB(ctx)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.Int("user.count", len(users)))
    return users, nil
}
```

**Reference:** See `references/instrumentation.md`

## Context Propagation

### HTTP Headers
```
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
```

### Propagation in HTTP Requests

#### Python
```python
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Injects trace context

response = requests.get('http://downstream-service/api', headers=headers)
```

#### Node.js
```javascript
const { propagation } = require('@opentelemetry/api');

const headers = {};
propagation.inject(context.active(), headers);

axios.get('http://downstream-service/api', { headers });
```

## Tempo Setup (Grafana)

### Kubernetes Deployment

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_http:
            grpc:
        otlp:
          protocols:
            http:
            grpc:

    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com

    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: tempo
        image: grafana/tempo:latest
        args:
          - -config.file=/etc/tempo/tempo.yaml
        volumeMounts:
        - name: config
          mountPath: /etc/tempo
      volumes:
      - name: config
        configMap:
          name: tempo-config
```

**Reference:** See `assets/jaeger-config.yaml.template`

## Sampling Strategies

### Probabilistic Sampling
```yaml
# Sample 1% of traces
sampler:
  type: probabilistic
  param: 0.01
```

### Rate Limiting Sampling
```yaml
# Sample max 100 traces per second
sampler:
  type: ratelimiting
  param: 100
```

### Adaptive Sampling
```python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

# Sample based on trace ID (deterministic)
sampler = ParentBased(root=TraceIdRatioBased(0.01))
```

## Trace Analysis

### Finding Slow Requests

**Jaeger Query:**
```
service=my-service
duration > 1s
```

### Finding Errors

**Jaeger Query:**
```
service=my-service
error=true
tags.http.status_code >= 500
```

### Service Dependency Graph

Jaeger automatically generates service dependency graphs showing:
- Service relationships
- Request rates
- Error rates
- Average latencies

## Best Practices

1. **Sample appropriately** (1-10% in production)
2. **Add meaningful tags** (user_id, request_id)
3. **Propagate context** across all service boundaries
4. **Log exceptions** in spans
5. **Use consistent naming** for operations
6. **Monitor tracing overhead** (<1% CPU impact)
7. **Set up alerts** for trace errors
8. **Implement distributed context** (baggage)
9. **Use span events** for important milestones
10. **Document instrumentation** standards

## Integration with Logging

### Correlated Logs
```python
import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    logger.info(
        "Processing request",
        extra={"trace_id": format(trace_id, '032x')}
    )
```

## Troubleshooting

**No traces appearing:**
- Check collector endpoint
- Verify network connectivity
- Check sampling configuration
- Review application logs

**High latency overhead:**
- Reduce sampling rate
- Use batch span processor
- Check exporter configuration

## Reference Files

- `references/jaeger-setup.md` - Jaeger installation
- `references/instrumentation.md` - Instrumentation patterns
- `assets/jaeger-config.yaml.template` - Jaeger configuration

## Related Skills

- `prometheus-configuration` - For metrics
- `grafana-dashboards` - For visualization
- `slo-implementation` - For latency SLOs

Related Skills

manifest

from sickn33/antigravity-awesome-skills

Install and configure the Manifest observability plugin for your agents. Use when setting up telemetry, configuring API keys, or troubleshooting the plugin.

Observability & MonitoringClaude

grafana-dashboards

from sickn33/antigravity-awesome-skills

Create and manage production-ready Grafana dashboards for comprehensive system observability.

Observability & MonitoringClaude

azure-monitor-opentelemetry-py

from sickn33/antigravity-awesome-skills

Azure Monitor OpenTelemetry Distro for Python. Use for one-line Application Insights setup with auto-instrumentation.

Observability & MonitoringClaude

azure-monitor-opentelemetry-exporter-py

from sickn33/antigravity-awesome-skills

Azure Monitor OpenTelemetry Exporter for Python. Use for low-level OpenTelemetry export to Application Insights.

Observability & MonitoringClaude

azure-monitor-ingestion-py

from sickn33/antigravity-awesome-skills

Azure Monitor Ingestion SDK for Python. Use for sending custom logs to Log Analytics workspace via Logs Ingestion API.

Observability & MonitoringClaude

distributed-debugging-debug-trace

from sickn33/antigravity-awesome-skills

You are a debugging expert specializing in setting up comprehensive debugging environments, distributed tracing, and diagnostic tools. Configure debugging workflows, implement tracing solutions, and establish troubleshooting practices for development and production environments.

Developer ToolsClaude

nft-standards

from sickn33/antigravity-awesome-skills

Master ERC-721 and ERC-1155 NFT standards, metadata best practices, and advanced NFT features.

Web3 & BlockchainClaude

nextjs-app-router-patterns

from sickn33/antigravity-awesome-skills

Comprehensive patterns for Next.js 14+ App Router architecture, Server Components, and modern full-stack React development.

Web FrameworksClaude

new-rails-project

from sickn33/antigravity-awesome-skills

Create a new Rails project

Code GenerationClaude

networkx

from sickn33/antigravity-awesome-skills

NetworkX is a Python package for creating, manipulating, and analyzing complex networks and graphs.

Network AnalysisClaude

network-engineer

from sickn33/antigravity-awesome-skills

Expert network engineer specializing in modern cloud networking, security architectures, and performance optimization.

Network EngineeringClaude

nestjs-expert

from sickn33/antigravity-awesome-skills

You are an expert in Nest.js with deep knowledge of enterprise-grade Node.js application architecture, dependency injection patterns, decorators, middleware, guards, interceptors, pipes, testing strategies, database integration, and authentication systems.

Frameworks & LibrariesClaude