langsmith-observability
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.
Best use case
langsmith-observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.
Teams using langsmith-observability should expect a more consistent output, faster repeated execution, less prompt rewriting, better workflow continuity with your supporting tools.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
- You already have the supporting tools or dependencies needed by this skill.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/observability-langsmith/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How langsmith-observability Compares
| Feature / Agent | langsmith-observability | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# LangSmith - LLM Observability Platform
Development platform for debugging, evaluating, and monitoring language models and AI applications.
## When to use LangSmith
**Use LangSmith when:**
- Debugging LLM application issues (prompts, chains, agents)
- Evaluating model outputs systematically against datasets
- Monitoring production LLM systems
- Building regression testing for AI features
- Analyzing latency, token usage, and costs
- Collaborating on prompt engineering
**Key features:**
- **Tracing**: Capture inputs, outputs, latency for all LLM calls
- **Evaluation**: Systematic testing with built-in and custom evaluators
- **Datasets**: Create test sets from production traces or manually
- **Monitoring**: Track metrics, errors, and costs in production
- **Integrations**: Works with OpenAI, Anthropic, LangChain, LlamaIndex
**Use alternatives instead:**
- **Weights & Biases**: Deep learning experiment tracking, model training
- **MLflow**: General ML lifecycle, model registry focus
- **Arize/WhyLabs**: ML monitoring, data drift detection
## Quick start
### Installation
```bash
pip install langsmith
# Set environment variables
export LANGSMITH_API_KEY="your-api-key"
export LANGSMITH_TRACING=true
```
### Basic tracing with @traceable
```python
from langsmith import traceable
from openai import OpenAI
client = OpenAI()
@traceable
def generate_response(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Automatically traced to LangSmith
result = generate_response("What is machine learning?")
```
### OpenAI wrapper (automatic tracing)
```python
from langsmith.wrappers import wrap_openai
from openai import OpenAI
# Wrap client for automatic tracing
client = wrap_openai(OpenAI())
# All calls automatically traced
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
```
## Core concepts
### Runs and traces
A **run** is a single execution unit (LLM call, chain, tool). Runs form hierarchical **traces** showing the full execution flow.
```python
from langsmith import traceable
@traceable(run_type="chain")
def process_query(query: str) -> str:
# Parent run
context = retrieve_context(query) # Child run
response = generate_answer(query, context) # Child run
return response
@traceable(run_type="retriever")
def retrieve_context(query: str) -> list:
return vector_store.search(query)
@traceable(run_type="llm")
def generate_answer(query: str, context: list) -> str:
return llm.invoke(f"Context: {context}\n\nQuestion: {query}")
```
### Projects
Projects organize related runs. Set via environment or code:
```python
import os
os.environ["LANGSMITH_PROJECT"] = "my-project"
# Or per-function
@traceable(project_name="my-project")
def my_function():
pass
```
## Client API
```python
from langsmith import Client
client = Client()
# List runs
runs = list(client.list_runs(
project_name="my-project",
filter='eq(status, "success")',
limit=100
))
# Get run details
run = client.read_run(run_id="...")
# Create feedback
client.create_feedback(
run_id="...",
key="correctness",
score=0.9,
comment="Good answer"
)
```
## Datasets and evaluation
### Create dataset
```python
from langsmith import Client
client = Client()
# Create dataset
dataset = client.create_dataset("qa-test-set", description="QA evaluation")
# Add examples
client.create_examples(
inputs=[
{"question": "What is Python?"},
{"question": "What is ML?"}
],
outputs=[
{"answer": "A programming language"},
{"answer": "Machine learning"}
],
dataset_id=dataset.id
)
```
### Run evaluation
```python
from langsmith import evaluate
def my_model(inputs: dict) -> dict:
# Your model logic
return {"answer": generate_answer(inputs["question"])}
def correctness_evaluator(run, example):
prediction = run.outputs["answer"]
reference = example.outputs["answer"]
score = 1.0 if reference.lower() in prediction.lower() else 0.0
return {"key": "correctness", "score": score}
results = evaluate(
my_model,
data="qa-test-set",
evaluators=[correctness_evaluator],
experiment_prefix="v1"
)
print(f"Average score: {results.aggregate_metrics['correctness']}")
```
### Built-in evaluators
```python
from langsmith.evaluation import LangChainStringEvaluator
# Use LangChain evaluators
results = evaluate(
my_model,
data="qa-test-set",
evaluators=[
LangChainStringEvaluator("qa"),
LangChainStringEvaluator("cot_qa")
]
)
```
## Advanced tracing
### Tracing context
```python
from langsmith import tracing_context
with tracing_context(
project_name="experiment-1",
tags=["production", "v2"],
metadata={"version": "2.0"}
):
# All traceable calls inherit context
result = my_function()
```
### Manual runs
```python
from langsmith import trace
with trace(
name="custom_operation",
run_type="tool",
inputs={"query": "test"}
) as run:
result = do_something()
run.end(outputs={"result": result})
```
### Process inputs/outputs
```python
def sanitize_inputs(inputs: dict) -> dict:
if "password" in inputs:
inputs["password"] = "***"
return inputs
@traceable(process_inputs=sanitize_inputs)
def login(username: str, password: str):
return authenticate(username, password)
```
### Sampling
```python
import os
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1" # 10% sampling
```
## LangChain integration
```python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# Tracing enabled automatically with LANGSMITH_TRACING=true
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("user", "{input}")
])
chain = prompt | llm
# All chain runs traced automatically
response = chain.invoke({"input": "Hello!"})
```
## Production monitoring
### Hub prompts
```python
from langsmith import Client
client = Client()
# Pull prompt from hub
prompt = client.pull_prompt("my-org/qa-prompt")
# Use in application
result = prompt.invoke({"question": "What is AI?"})
```
### Async client
```python
from langsmith import AsyncClient
async def main():
client = AsyncClient()
runs = []
async for run in client.list_runs(project_name="my-project"):
runs.append(run)
return runs
```
### Feedback collection
```python
from langsmith import Client
client = Client()
# Collect user feedback
def record_feedback(run_id: str, user_rating: int, comment: str = None):
client.create_feedback(
run_id=run_id,
key="user_rating",
score=user_rating / 5.0, # Normalize to 0-1
comment=comment
)
# In your application
record_feedback(run_id="...", user_rating=4, comment="Helpful response")
```
## Testing integration
### Pytest integration
```python
from langsmith import test
@test
def test_qa_accuracy():
result = my_qa_function("What is Python?")
assert "programming" in result.lower()
```
### Evaluation in CI/CD
```python
from langsmith import evaluate
def run_evaluation():
results = evaluate(
my_model,
data="regression-test-set",
evaluators=[accuracy_evaluator]
)
# Fail CI if accuracy drops
assert results.aggregate_metrics["accuracy"] >= 0.9, \
f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"
```
## Best practices
1. **Structured naming** - Use consistent project/run naming conventions
2. **Add metadata** - Include version, environment, user info
3. **Sample in production** - Use sampling rate to control volume
4. **Create datasets** - Build test sets from interesting production cases
5. **Automate evaluation** - Run evaluations in CI/CD pipelines
6. **Monitor costs** - Track token usage and latency trends
## Common issues
**Traces not appearing:**
```python
import os
# Ensure tracing is enabled
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "your-key"
# Verify connection
from langsmith import Client
client = Client()
print(client.list_projects()) # Should work
```
**High latency from tracing:**
```python
# Enable background batching (default)
from langsmith import Client
client = Client(auto_batch_tracing=True)
# Or use sampling
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"
```
**Large payloads:**
```python
# Hide sensitive/large fields
@traceable(
process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"}
)
def my_function(data):
pass
```
## References
- **[Advanced Usage](references/advanced-usage.md)** - Custom evaluators, distributed tracing, hub prompts
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, performance
## Resources
- **Documentation**: https://docs.smith.langchain.com
- **Python SDK**: https://github.com/langchain-ai/langsmith-sdk
- **Web App**: https://smith.langchain.com
- **Version**: 0.2.0+
- **License**: MITRelated Skills
observability-engineer
Build production-ready monitoring, logging, and tracing systems. Implements comprehensive observability strategies, SLI/SLO management, and incident response workflows.
phoenix-observability
Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when debugging LLM applications with detailed traces, running evaluations on datasets, or monitoring production AI systems with real-time insights.
async-python-patterns
Comprehensive guidance for implementing asynchronous Python applications using asyncio, concurrent programming patterns, and async/await for building high-performance, non-blocking systems.
slack-automation
Automate Slack workspace operations including messaging, search, channel management, and reaction workflows through Composio's Slack toolkit.
linear-automation
Automate Linear tasks via Rube MCP (Composio): issues, projects, cycles, teams, labels. Always search tools first for current schemas.
jira-automation
Automate Jira tasks via Rube MCP (Composio): issues, projects, sprints, boards, comments, users. Always search tools first for current schemas.
gitops-workflow
Complete guide to implementing GitOps workflows with ArgoCD and Flux for automated Kubernetes deployments.
github-automation
Automate GitHub repositories, issues, pull requests, branches, CI/CD, and permissions via Rube MCP (Composio). Manage code workflows, review PRs, search code, and handle deployments programmatically.
github-actions-templates
Production-ready GitHub Actions workflow patterns for testing, building, and deploying applications.
zustand-store-ts
Create Zustand stores following established patterns with proper TypeScript types and middleware.
zod-validation-expert
Expert in Zod — TypeScript-first schema validation. Covers parsing, custom errors, refinements, type inference, and integration with React Hook Form, Next.js, and tRPC.
tanstack-query-expert
Expert in TanStack Query (React Query) — asynchronous state management. Covers data fetching, stale time configuration, mutations, optimistic updates, and Next.js App Router (SSR) integration.