Braintrust — AI Evaluation and Observability
You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.
Best use case
Braintrust — AI Evaluation and Observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.
Teams using Braintrust — AI Evaluation and Observability should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/braintrust/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Braintrust — AI Evaluation and Observability Compares
| Feature / Agent | Braintrust — AI Evaluation and Observability | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Braintrust — AI Evaluation and Observability
You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.
## Core Capabilities
```typescript
import { Eval, init } from "braintrust";
init({ apiKey: process.env.BRAINTRUST_API_KEY });
// Run evaluation
await Eval("support-chatbot", {
data: () => [
{ input: "How do I reset my password?", expected: "Go to Settings > Security > Reset Password" },
{ input: "What's the pricing?", expected: "Plans start at $29/month" },
{ input: "I need a refund", expected: "Contact support at help@example.com" },
],
task: async (input) => {
const response = await callChatbot(input);
return response.text;
},
scores: [
// Built-in scorers
Factuality, // Does output match expected facts?
ClosedQA, // Is the answer correct given context?
// Custom scorer
(output, expected) => {
const containsKey = expected.toLowerCase().split(" ")
.some(word => output.toLowerCase().includes(word));
return { name: "keyword_match", score: containsKey ? 1 : 0 };
},
],
});
// Results visible in Braintrust dashboard with diffs, regressions, improvements
```
```python
# Python
from braintrust import Eval
Eval(
"rag-pipeline",
data=lambda: [{"input": q, "expected": a} for q, a in test_pairs],
task=lambda input: rag_pipeline.query(input),
scores=[Factuality, Relevance],
)
```
## Installation
```bash
npm install braintrust autoevals
# or
pip install braintrust autoevals
```
## Best Practices
1. **Eval-driven development** — Write evals first, then iterate on prompts/models; measure before optimizing
2. **Built-in scorers** — Use Factuality, ClosedQA, Relevance from `autoevals`; LLM-based quality scoring
3. **Custom scorers** — Add domain-specific metrics; combine with built-in for comprehensive evaluation
4. **Experiments** — Each eval run is an experiment; compare side-by-side in dashboard
5. **Production logging** — Use `braintrust.traced()` for production observability; same dashboard as evals
6. **CI integration** — Run evals in CI; fail builds on quality regressions
7. **Dataset management** — Store test datasets in Braintrust; version and share across team
8. **A/B comparison** — Compare two model versions on the same dataset; statistical significance reportedRelated Skills
model-evaluation-metrics
Model Evaluation Metrics - Auto-activating skill for ML Training. Triggers on: model evaluation metrics, model evaluation metrics Part of the ML Training skill category.
exa-observability
Set up monitoring, metrics, and alerting for Exa search integrations. Use when implementing monitoring for Exa operations, building dashboards, or configuring alerting for search quality and latency. Trigger with phrases like "exa monitoring", "exa metrics", "exa observability", "monitor exa", "exa alerts", "exa dashboard".
evernote-observability
Implement observability for Evernote integrations. Use when setting up monitoring, logging, tracing, or alerting for Evernote applications. Trigger with phrases like "evernote monitoring", "evernote logging", "evernote metrics", "evernote observability".
documenso-observability
Implement monitoring, logging, and tracing for Documenso integrations. Use when setting up observability, implementing metrics collection, or debugging production issues. Trigger with phrases like "documenso monitoring", "documenso metrics", "documenso logging", "documenso tracing", "documenso observability".
deepgram-observability
Set up comprehensive observability for Deepgram integrations. Use when implementing monitoring, setting up dashboards, or configuring alerting for Deepgram integration health. Trigger: "deepgram monitoring", "deepgram metrics", "deepgram observability", "monitor deepgram", "deepgram alerts", "deepgram dashboard".
databricks-observability
Set up comprehensive observability for Databricks with metrics, traces, and alerts. Use when implementing monitoring for Databricks jobs, setting up dashboards, or configuring alerting for pipeline health. Trigger with phrases like "databricks monitoring", "databricks metrics", "databricks observability", "monitor databricks", "databricks alerts", "databricks logging".
customerio-observability
Set up Customer.io monitoring and observability. Use when implementing metrics, structured logging, alerting, or Grafana dashboards for Customer.io integrations. Trigger: "customer.io monitoring", "customer.io metrics", "customer.io dashboard", "customer.io alerts", "customer.io observability".
coreweave-observability
Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".
cohere-observability
Set up comprehensive observability for Cohere API v2 with metrics, traces, and alerts. Use when implementing monitoring for Chat/Embed/Rerank operations, setting up dashboards, or configuring alerts for Cohere integrations. Trigger with phrases like "cohere monitoring", "cohere metrics", "cohere observability", "monitor cohere", "cohere alerts", "cohere tracing".
coderabbit-observability
Monitor CodeRabbit review effectiveness with metrics, dashboards, and alerts. Use when tracking review coverage, measuring comment acceptance rates, or building dashboards for CodeRabbit adoption across your organization. Trigger with phrases like "coderabbit monitoring", "coderabbit metrics", "coderabbit observability", "monitor coderabbit", "coderabbit alerts", "coderabbit dashboard".
clickup-observability
Monitor ClickUp API integrations with metrics, tracing, structured logging, and alerting using Prometheus, OpenTelemetry, and Grafana. Trigger: "clickup monitoring", "clickup metrics", "clickup observability", "monitor clickup", "clickup alerts", "clickup tracing", "clickup dashboard".
clickhouse-observability
Monitor ClickHouse with Prometheus metrics, Grafana dashboards, system table queries, and alerting for query performance, merge health, and resource usage. Use when setting up ClickHouse monitoring, building Grafana dashboards, or configuring alerts for production ClickHouse deployments. Trigger: "clickhouse monitoring", "clickhouse metrics", "clickhouse Grafana", "clickhouse observability", "monitor clickhouse", "clickhouse Prometheus".