langsmith-testing

LangSmith trace validation for RAG observability - every query must be traced

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

langsmith-testing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

LangSmith trace validation for RAG observability - every query must be traced

Teams using langsmith-testing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/langsmith-testing/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/ai-agents/langsmith-testing/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/langsmith-testing/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How langsmith-testing Compares

Feature / Agent	langsmith-testing	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

LangSmith trace validation for RAG observability - every query must be traced

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 📊 LangSmith Testing SKILL

## Purpose

**CRITICAL for RAG:** Every RAG query MUST be traced in LangSmith for observability. Silent failures (bad retrieval, hallucinations) are only detectable through tracing.

---

## Auto-Trigger Conditions

**Activate when:**
- User mentions: "LangSmith", "trace", "evaluation", "tracking"
- RAG query execution
- Evaluation tasks
- Performance debugging

---

## LangSmith Setup

### 1. Installation

```bash
# Python
pip install langsmith langchain

# Environment variables
export LANGSMITH_API_KEY="your-api-key"
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_PROJECT="rag-demo"
```

### 2. Basic Integration

```python
from langchain.callbacks import LangChainTracer
from langsmith import Client

# Initialize client
client = Client(api_key=os.environ["LANGSMITH_API_KEY"])

# Create tracer
tracer = LangChainTracer(project_name="rag-demo")

# Use in chain
chain.invoke(query, config={"callbacks": [tracer]})
```

---

## Trace Collection

### 1. Automatic Tracing (Recommended)

```python
from langchain_core.tracers.context import tracing_v2_enabled

with tracing_v2_enabled(project_name="rag-demo"):
    # All operations automatically traced
    result = rag_chain.invoke(query)
```

### 2. Manual Tracing

```python
from langchain.callbacks import LangChainTracer

tracer = LangChainTracer(
    project_name="rag-demo",
    example_id="example-123"  # Optional: link to dataset
)

result = chain.invoke(query, config={"callbacks": [tracer]})
```

### 3. Trace Metadata

```python
from langchain_core.tracers.context import tracing_v2_enabled

with tracing_v2_enabled(
    project_name="rag-demo",
    metadata={
        "user_id": "user-123",
        "session_id": "session-456",
        "environment": "production"
    }
):
    result = chain.invoke(query)
```

---

## Metrics Collection

### 1. Core RAG Metrics

**Collected automatically:**
- **Latency:** Total pipeline time (ms)
- **Token usage:** Input + output tokens
- **Steps:** Number of LLM calls
- **Errors:** Failed operations

**Example trace data:**
```json
{
  "run_id": "abc-123",
  "name": "RAGChain",
  "latency_ms": 1250,
  "total_tokens": 850,
  "prompt_tokens": 600,
  "completion_tokens": 250,
  "steps": [
    {"name": "embedder", "latency_ms": 50},
    {"name": "retriever", "latency_ms": 200},
    {"name": "generator", "latency_ms": 1000}
  ]
}
```

### 2. Custom Metrics (Faithfulness, Relevance)

```python
from langsmith import Client

client = Client()

# After getting RAG result
client.create_feedback(
    run_id=run_id,
    key="faithfulness",
    score=0.85,  # 0-1 scale
    comment="Answer is mostly accurate to context"
)

client.create_feedback(
    run_id=run_id,
    key="relevance",
    score=0.92,  # 0-1 scale
    comment="Retrieved docs are highly relevant"
)
```

**See:** `rag-accuracy-SKILL.md` for metric calculations

---

## Trace Validation

### 1. Query Trace

```python
from langsmith import Client

client = Client()

# Get specific run
run = client.read_run(run_id="abc-123")

# Validate trace
assert run.latency_ms < 2000, "Query too slow"
assert run.error is None, "Query failed"
assert len(run.child_runs) >= 3, "Missing pipeline steps"
```

### 2. Batch Validation

```python
from langsmith import Client
from datetime import datetime, timedelta

client = Client()

# Get recent runs
runs = client.list_runs(
    project_name="rag-demo",
    start_time=datetime.now() - timedelta(hours=1)
)

# Aggregate metrics
avg_latency = sum(r.latency_ms for r in runs) / len(runs)
error_rate = sum(1 for r in runs if r.error) / len(runs)

print(f"Avg latency: {avg_latency}ms")
print(f"Error rate: {error_rate:.2%}")
```

---

## Run Comparisons (A/B Testing)

### 1. Compare Prompt Variants

```python
from langsmith import Client

client = Client()

# Run A: Original prompt
with tracing_v2_enabled(
    project_name="rag-demo",
    metadata={"variant": "prompt-v1"}
):
    result_a = chain_v1.invoke(query)

# Run B: New prompt
with tracing_v2_enabled(
    project_name="rag-demo",
    metadata={"variant": "prompt-v2"}
):
    result_b = chain_v2.invoke(query)

# Compare in LangSmith UI
# Filter by metadata.variant to see performance difference
```

### 2. Automated Comparison

```python
from langsmith import Client

client = Client()

# Get runs for each variant
runs_v1 = client.list_runs(
    project_name="rag-demo",
    filter='metadata.variant == "prompt-v1"'
)

runs_v2 = client.list_runs(
    project_name="rag-demo",
    filter='metadata.variant == "prompt-v2"'
)

# Compare metrics
v1_latency = sum(r.latency_ms for r in runs_v1) / len(runs_v1)
v2_latency = sum(r.latency_ms for r in runs_v2) / len(runs_v2)

print(f"V1 avg latency: {v1_latency}ms")
print(f"V2 avg latency: {v2_latency}ms")
print(f"Improvement: {((v1_latency - v2_latency) / v1_latency) * 100:.1f}%")
```

---

## Dataset Evaluation

### 1. Create Dataset

```python
from langsmith import Client

client = Client()

# Create evaluation dataset
dataset = client.create_dataset("rag-eval-v1")

# Add examples
client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"query": "What is RAG?"},
        {"query": "How does vector search work?"}
    ],
    outputs=[
        {"expected_answer": "RAG is Retrieval Augmented Generation..."},
        {"expected_answer": "Vector search uses embeddings..."}
    ]
)
```

### 2. Run Evaluation

```python
from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# Define evaluator
def faithfulness_evaluator(run, example):
    # Calculate faithfulness score
    score = calculate_faithfulness(
        run.outputs["answer"],
        run.outputs["source_documents"]
    )
    return {"score": score}

# Run evaluation
results = evaluate(
    lambda inputs: rag_chain.invoke(inputs["query"]),
    data="rag-eval-v1",
    evaluators=[faithfulness_evaluator],
    project_name="rag-eval-results"
)

print(f"Avg faithfulness: {results['aggregate']['faithfulness']['mean']}")
```

---

## Error Detection

### 1. Failed Retrievals

```python
from langsmith import Client

client = Client()

# Find runs with no retrieved documents
failed_runs = client.list_runs(
    project_name="rag-demo",
    filter='outputs.source_documents.length == 0'
)

for run in failed_runs:
    print(f"Query: {run.inputs['query']}")
    print(f"Reason: No relevant documents found")
```

### 2. High Latency Queries

```python
from langsmith import Client

client = Client()

# Find slow queries (>2s)
slow_runs = client.list_runs(
    project_name="rag-demo",
    filter='latency_ms > 2000'
)

for run in slow_runs:
    print(f"Query: {run.inputs['query']}")
    print(f"Latency: {run.latency_ms}ms")

    # Identify bottleneck
    for child in run.child_runs:
        if child.latency_ms > 1000:
            print(f"  Bottleneck: {child.name} ({child.latency_ms}ms)")
```

---

## Best Practices

### 1. Always Trace Production

```python
# Enable tracing in production
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "rag-production"

# Sample rate for high-volume apps (optional)
os.environ["LANGCHAIN_TRACING_SAMPLING_RATE"] = "0.1"  # 10% of requests
```

### 2. Use Meaningful Project Names

```python
# Bad: Generic names
project_name = "test"

# Good: Environment + purpose
project_name = f"rag-{environment}-{feature}"  # "rag-prod-hybrid-search"
```

### 3. Add Context with Metadata

```python
with tracing_v2_enabled(
    project_name="rag-demo",
    metadata={
        "user_tier": "premium",
        "retriever_type": "hybrid",
        "llm_model": "claude-3-sonnet",
        "chunk_size": 512
    }
):
    result = chain.invoke(query)
```

### 4. Monitor Key Metrics

**Daily checks:**
- [ ] Avg latency < 2000ms
- [ ] Error rate < 1%
- [ ] Faithfulness score > 0.7
- [ ] Relevance score > 0.7

**Weekly checks:**
- [ ] Compare A/B test variants
- [ ] Review slow queries
- [ ] Identify common failure patterns

---

## Integration with Other Skills

### Workflow

```
User Query
   ↓
LangSmith Tracing (THIS SKILL)
   ↓
RAG Pipeline Execution
   ↓
Trace Collection (automatic)
   ↓
Metrics Calculation (rag-accuracy-SKILL.md)
   ↓
Feedback to LangSmith
   ↓
Dashboard Analysis
```

### Related Files

| Skill | Purpose |
|-------|---------|
| `rag-accuracy-SKILL.md` | Calculate faithfulness, relevance scores |
| `e2e-testing-SKILL.md` | E2E tests with LangSmith assertions |
| `llm-integration-SKILL.md` | LLM provider switching (affects traces) |

---

## Troubleshooting

### Issue: Traces not appearing

**Check:**
1. `LANGCHAIN_TRACING_V2=true` set?
2. `LANGSMITH_API_KEY` valid?
3. Project name correct?

**Solution:**
```python
import os
print(os.environ.get("LANGCHAIN_TRACING_V2"))  # Should be "true"
print(os.environ.get("LANGSMITH_API_KEY"))     # Should be set
```

### Issue: High latency in traces

**Check:**
1. Which step is slow? (embedder, retriever, generator)
2. Network latency to LangSmith?

**Solution:**
```python
# Disable tracing for testing
os.environ["LANGCHAIN_TRACING_V2"] = "false"
# If latency drops → LangSmith network issue
# If latency same → RAG pipeline issue
```

---

## Example: Complete RAG with LangSmith

```python
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_core.tracers.context import tracing_v2_enabled
from langsmith import Client

# Setup
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "rag-demo"

# Build RAG chain
embeddings = OpenAIEmbeddings()
vector_store = FAISS.load_local("./index", embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

llm = ChatOpenAI(model="gpt-4", temperature=0)
chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

# Query with tracing
query = "What is retrieval augmented generation?"

with tracing_v2_enabled(
    project_name="rag-demo",
    metadata={"environment": "demo"}
) as cb:
    result = chain.invoke(query)
    run_id = cb.run_id

# Add feedback
client = Client()
client.create_feedback(
    run_id=run_id,
    key="faithfulness",
    score=0.9,
    comment="Accurate answer"
)

print(f"Answer: {result['result']}")
print(f"Trace: https://smith.langchain.com/public/{run_id}")
```

---

**Last Updated:** 2025-12-04
**Version:** 1.0
**Priority:** CRITICAL (Auto-loads for all RAG queries)

Related Skills

minitest-testing

from diegosouzapw/awesome-omni-skill

Write, review, and improve Minitest tests for Ruby on Rails applications. Covers model tests, controller tests, system tests, fixtures, and best practices from Rails Testing Guide.

langsmith-fetch

from diegosouzapw/awesome-omni-skill

Debug LangChain and LangGraph agents by fetching execution traces from LangSmith Studio. Use when debugging agent behavior, investigating errors, analyzing tool calls, checking memory operations, or examining agent performance. Automatically fetches recent traces and analyzes execution patterns. Requires langsmith-fetch CLI installed.

ai-powered-pentesting

from diegosouzapw/awesome-omni-skill

Guide for AI-powered penetration testing tools, red teaming frameworks, and autonomous security agents.

ab-testing-analyzer

from diegosouzapw/awesome-omni-skill

全面的AB测试分析工具，支持实验设计、统计检验、用户分群分析和可视化报告生成。用于分析产品改版、营销活动、功能优化等AB测试结果，提供统计显著性检验和深度洞察。

cli-e2e-testing

from diegosouzapw/awesome-omni-skill

CLI E2E testing patterns with BATS - parallelization, state sharing, and timeout management

bats-testing-patterns

from diegosouzapw/awesome-omni-skill

Comprehensive guide for writing shell script tests using Bats (Bash Automated Testing System). Use when writing or improving tests for Bash/shell scripts, creating test fixtures, mocking commands, or setting up CI/CD for shell script testing. Includes patterns for assertions, setup/teardown, mocking, fixtures, and integration with GitHub Actions.

adb-device-testing

from diegosouzapw/awesome-omni-skill

Use when testing Android apps on ADB-connected devices/emulators - UI automation, screenshots, location spoofing, navigation, app management. Triggers on ADB, emulator, Android testing, location mock, UI test, screenshot walkthrough.

sqlmap-database-pentesting

from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "automate SQL injection testing," "enumerate database structure," "extract database credentials using sqlmap," "dump tables and columns...

sql-injection-testing

from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "test for SQL injection vulnerabilities", "perform SQLi attacks", "bypass authentication using SQL injection", "extract database inform...

Contract Testing Pact

from diegosouzapw/awesome-omni-skill

Contract testing validates that service consumers and providers agree on request/response expectations. Pact implements consumer-driven contracts (CDC) with shareable pact files and provider verificat

Async Testing Expert

from diegosouzapw/awesome-omni-skill

Comprehensive pytest skill for async Python testing with proper mocking, fixtures, and patterns from production-ready test suites. Use when writing or improving async tests for Python applications, especially FastAPI backends with database interactions.

api-testing

from diegosouzapw/awesome-omni-skill

Test FastAPI endpoints with pytest and generate API documentation. Use when creating new APIs or verifying existing endpoints work correctly.