service-mesh-observability
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
Best use case
service-mesh-observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.
Practical example
Example input
Use the "service-mesh-observability" skill to help with this workflow task. Context: Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
Example output
A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.
When to use this skill
- Use this skill when you want a reusable workflow rather than writing the same prompt again and again.
When not to use this skill
- Do not use this when you only need a one-off answer and do not need a reusable workflow.
- Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/service-mesh-observability/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How service-mesh-observability Compares
| Feature / Agent | service-mesh-observability | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
SKILL.md Source
# Service Mesh Observability
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
## Do not use this skill when
- The task is unrelated to service mesh observability
- You need a different domain or tool outside this scope
## Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.
## Use this skill when
- Setting up distributed tracing across services
- Implementing service mesh metrics and dashboards
- Debugging latency and error issues
- Defining SLOs for service communication
- Visualizing service dependencies
- Troubleshooting mesh connectivity
## Core Concepts
### 1. Three Pillars of Observability
```
┌─────────────────────────────────────────────────────┐
│ Observability │
├─────────────────┬─────────────────┬─────────────────┤
│ Metrics │ Traces │ Logs │
│ │ │ │
│ • Request rate │ • Span context │ • Access logs │
│ • Error rate │ • Latency │ • Error details │
│ • Latency P50 │ • Dependencies │ • Debug info │
│ • Saturation │ • Bottlenecks │ • Audit trail │
└─────────────────┴─────────────────┴─────────────────┘
```
### 2. Golden Signals for Mesh
| Signal | Description | Alert Threshold |
|--------|-------------|-----------------|
| **Latency** | Request duration P50, P99 | P99 > 500ms |
| **Traffic** | Requests per second | Anomaly detection |
| **Errors** | 5xx error rate | > 1% |
| **Saturation** | Resource utilization | > 80% |
## Templates
### Template 1: Istio with Prometheus & Grafana
```yaml
# Install Prometheus
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus
namespace: istio-system
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'istio-mesh'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- istio-system
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: istio-telemetry
---
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-mesh
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
endpoints:
- port: http-monitoring
interval: 15s
```
### Template 2: Key Istio Metrics Queries
```promql
# Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
# Error rate (5xx)
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
# TCP connections
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
# Request size
histogram_quantile(0.99,
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
```
### Template 3: Jaeger Distributed Tracing
```yaml
# Jaeger installation for Istio
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
enableTracing: true
defaultConfig:
tracing:
sampling: 100.0 # 100% in dev, lower in prod
zipkin:
address: jaeger-collector.istio-system:9411
---
# Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: istio-system
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
ports:
- containerPort: 5775 # UDP
- containerPort: 6831 # Thrift
- containerPort: 6832 # Thrift
- containerPort: 5778 # Config
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP
- containerPort: 14250 # gRPC
- containerPort: 9411 # Zipkin
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
```
### Template 4: Linkerd Viz Dashboard
```bash
# Install Linkerd viz extension
linkerd viz install | kubectl apply -f -
# Access dashboard
linkerd viz dashboard
# CLI commands for observability
# Top requests
linkerd viz top deploy/my-app
# Per-route metrics
linkerd viz routes deploy/my-app --to deploy/backend
# Live traffic inspection
linkerd viz tap deploy/my-app --to deploy/backend
# Service edges (dependencies)
linkerd viz edges deployment -n my-namespace
```
### Template 5: Grafana Dashboard JSON
```json
{
"dashboard": {
"title": "Service Mesh Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Service Topology",
"type": "nodeGraph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
}
]
}
]
}
}
```
### Template 6: Kiali Service Mesh Visualization
```yaml
# Kiali installation
apiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
name: kiali
namespace: istio-system
spec:
auth:
strategy: anonymous # or openid, token
deployment:
accessible_namespaces:
- "**"
external_services:
prometheus:
url: http://prometheus.istio-system:9090
tracing:
url: http://jaeger-query.istio-system:16686
grafana:
url: http://grafana.istio-system:3000
```
### Template 7: OpenTelemetry Integration
```yaml
# OpenTelemetry Collector for mesh
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
---
# Istio Telemetry v2 with OTel
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: otel
randomSamplingPercentage: 10
```
## Alerting Rules
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mesh-alerts
namespace: istio-system
spec:
groups:
- name: mesh.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.destination_service_name }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (le, destination_service_name)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for {{ $labels.destination_service_name }}"
- alert: MeshCertExpiring
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
labels:
severity: warning
annotations:
summary: "Mesh certificate expiring in less than 7 days"
```
## Best Practices
### Do's
- **Sample appropriately** - 100% in dev, 1-10% in prod
- **Use trace context** - Propagate headers consistently
- **Set up alerts** - For golden signals
- **Correlate metrics/traces** - Use exemplars
- **Retain strategically** - Hot/cold storage tiers
### Don'ts
- **Don't over-sample** - Storage costs add up
- **Don't ignore cardinality** - Limit label values
- **Don't skip dashboards** - Visualize dependencies
- **Don't forget costs** - Monitor observability costs
## Resources
- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
- [OpenTelemetry](https://opentelemetry.io/)
- [Kiali](https://kiali.io/)
## Limitations
- Use this skill only when the task clearly matches the scope described above.
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.Related Skills
microservices-patterns
Master microservices architecture patterns including service boundaries, inter-service communication, data management, and resilience patterns for building distributed systems.
azure-servicebus-ts
Enterprise messaging with queues, topics, and subscriptions.
azure-mgmt-arizeaiobservabilityeval-dotnet
Azure Resource Manager SDK for Arize AI Observability and Evaluation (.NET).
service-mesh-expert
Expert service mesh architect specializing in Istio, Linkerd, and cloud-native networking patterns. Masters traffic management, security policies, observability integration, and multi-cluster mesh con
odoo-project-timesheet
Expert guide for Odoo Project and Timesheets: task stages, billable time tracking, timesheet approval, budget alerts, and invoicing from timesheets.
observability-monitoring-slo-implement
You are an SLO (Service Level Objective) expert specializing in implementing reliability standards and error budget-based engineering practices. Design comprehensive SLO frameworks, establish meaningful SLIs, and create monitoring systems that balance reliability with feature velocity.
observability-monitoring-monitor-setup
You are a monitoring and observability expert specializing in implementing comprehensive monitoring solutions. Set up metrics collection, distributed tracing, log aggregation, and create insightful da
observability-engineer
Build production-ready monitoring, logging, and tracing systems. Implements comprehensive observability strategies, SLI/SLO management, and incident response workflows.
api-testing-observability-api-mock
You are an API mocking expert specializing in realistic mock services for development, testing, and demos. Design mocks that simulate real API behavior and enable parallel development.
nextjs-best-practices
Next.js App Router principles. Server Components, data fetching, routing patterns.
network-101
Configure and test common network services (HTTP, HTTPS, SNMP, SMB) for penetration testing lab environments. Enable hands-on practice with service enumeration, log analysis, and security testing against properly configured target systems.
neon-postgres
Expert patterns for Neon serverless Postgres, branching, connection pooling, and Prisma/Drizzle integration