Weave — AI Application Tracking by Weights & Biases

You are an expert in Weave, the lightweight toolkit by Weights & Biases for tracking and evaluating AI applications. You help developers trace LLM calls, evaluate outputs, compare model versions, track experiments, and debug AI pipelines — with automatic logging via decorators and a visual dashboard for exploring traces, costs, and quality metrics.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

Weave — AI Application Tracking by Weights & Biases is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using Weave — AI Application Tracking by Weights & Biases should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/weave/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/weave/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/weave/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Weave — AI Application Tracking by Weights & Biases Compares

Feature / Agent	Weave — AI Application Tracking by Weights & Biases	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Weave — AI Application Tracking by Weights & Biases

You are an expert in Weave, the lightweight toolkit by Weights & Biases for tracking and evaluating AI applications. You help developers trace LLM calls, evaluate outputs, compare model versions, track experiments, and debug AI pipelines — with automatic logging via decorators and a visual dashboard for exploring traces, costs, and quality metrics.

## Core Capabilities

### Automatic Tracing

```python
import weave
import openai

weave.init("my-ai-project")               # Initialize with project name

client = openai.OpenAI()

# OpenAI calls are automatically traced
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain transformers"}],
)
# Weave captures: model, tokens, latency, cost, input/output — viewable in dashboard

# Custom function tracing
@weave.op()
def extract_entities(text: str) -> list[str]:
    """Extract named entities from text."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Extract entities from: {text}\nReturn JSON list."}],
    )
    return json.loads(response.choices[0].message.content)

@weave.op()
def rag_pipeline(query: str) -> str:
    """Full RAG pipeline — each step traced as child span."""
    docs = retrieve_documents(query)       # Traced if decorated
    context = "\n".join(docs)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer using:\n{context}"},
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content
```

### Evaluations

```python
# Define evaluation dataset
eval_dataset = [
    {"query": "What is Python?", "expected": "programming language"},
    {"query": "Who created Linux?", "expected": "Linus Torvalds"},
    {"query": "What is Docker?", "expected": "containerization platform"},
]

# Define scoring functions
@weave.op()
def relevance_scorer(output: str, expected: str) -> dict:
    """Score if output contains expected information."""
    contains = expected.lower() in output.lower()
    return {"relevance": 1.0 if contains else 0.0}

@weave.op()
def length_scorer(output: str) -> dict:
    """Score response length (prefer concise)."""
    words = len(output.split())
    return {"conciseness": min(1.0, 50 / max(words, 1))}

# Run evaluation
evaluation = weave.Evaluation(
    dataset=eval_dataset,
    scorers=[relevance_scorer, length_scorer],
)

results = await evaluation.evaluate(rag_pipeline)
# Results visible in Weave dashboard with per-example scores
# Compare across model versions, prompts, parameters
```

### Model Versioning

```python
# Track model/prompt versions
class SupportAgent(weave.Model):
    model_name: str = "gpt-4o"
    system_prompt: str = "You are a helpful support agent."
    temperature: float = 0.7

    @weave.op()
    def predict(self, query: str) -> str:
        response = client.chat.completions.create(
            model=self.model_name,
            temperature=self.temperature,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": query},
            ],
        )
        return response.choices[0].message.content

# Version 1
agent_v1 = SupportAgent(system_prompt="Be concise and helpful.")

# Version 2 — compare in dashboard
agent_v2 = SupportAgent(model_name="gpt-4o-mini", system_prompt="Be detailed and empathetic.")

# Evaluate both versions
for agent in [agent_v1, agent_v2]:
    await evaluation.evaluate(agent)
# Dashboard shows side-by-side comparison: quality, cost, latency
```

## Installation

```bash
pip install weave
# Uses your W&B account — set WANDB_API_KEY
```

## Best Practices

1. **@weave.op() decorator** — Add to any function to trace it; creates hierarchical spans for nested calls
2. **Auto-instrumentation** — OpenAI, Anthropic, LangChain calls traced automatically after `weave.init()`
3. **Evaluations** — Define datasets + scorers; run systematically; compare versions in dashboard
4. **weave.Model** — Subclass for versioned models; parameters tracked, comparable across evaluations
5. **W&B integration** — Weave data appears in your W&B workspace; share with team, add to reports
6. **Cost tracking** — Automatic per-call cost calculation; aggregate by function, model, or user
7. **Production monitoring** — Use in production for continuous quality tracking; alert on regressions
8. **Lightweight** — Single `@weave.op()` decorator; no complex setup, no separate infrastructure

Related Skills

tracking-token-launches

from ComeOnOliver/skillshub

Track new token launches across DEXes with risk analysis and contract verification. Use when discovering new token launches, monitoring IDOs, or analyzing token contracts. Trigger with phrases like "track launches", "find new tokens", "new pairs on uniswap", "token risk analysis", or "monitor IDOs".

tracking-service-reliability

from ComeOnOliver/skillshub

Define and track SLAs, SLIs, and SLOs for service reliability including availability, latency, and error rates. Use when establishing reliability targets or monitoring service health. Trigger with phrases like "define SLOs", "track SLI metrics", or "calculate error budget".

tracking-resource-usage

from ComeOnOliver/skillshub

Track and optimize resource usage across application stack including CPU, memory, disk, and network I/O. Use when identifying bottlenecks or optimizing costs. Trigger with phrases like "track resource usage", "monitor CPU and memory", or "optimize resource allocation".

tracking-model-versions

from ComeOnOliver/skillshub

Build this skill enables AI assistant to track and manage ai/ml model versions using the model-versioning-tracker plugin. it should be used when the user asks to manage model versions, track model lineage, log model performance, or implement version control f... Use when appropriate context detected. Trigger with relevant phrases based on skill purpose.

tracking-crypto-prices

from ComeOnOliver/skillshub

Track real-time cryptocurrency prices across exchanges with historical data and alerts. Provides price data infrastructure for dependent skills (portfolio, tax, DeFi, arbitrage). Use when checking crypto prices, monitoring markets, or fetching historical price data. Trigger with phrases like "check price", "BTC price", "crypto prices", "price history", "get quote for", "what's ETH trading at", "show me top coins", or "track my watchlist".

tracking-crypto-portfolio

from ComeOnOliver/skillshub

Track cryptocurrency portfolio with real-time valuations, allocation analysis, and P&L tracking. Use when checking portfolio value, viewing holdings breakdown, analyzing allocations, or exporting portfolio data. Trigger with phrases like "show my portfolio", "check crypto holdings", "portfolio allocation", "track my crypto", or "export portfolio".

tracking-crypto-derivatives

from ComeOnOliver/skillshub

Track cryptocurrency futures, options, and perpetual swaps with funding rates, open interest, liquidations, and comprehensive derivatives market analysis. Use when monitoring derivatives markets, analyzing funding rates, tracking open interest, finding liquidation levels, or researching options flow. Trigger with phrases like "funding rate", "open interest", "perpetual swap", "futures basis", "liquidation levels", "options flow", "put call ratio", "derivatives analysis", or "BTC perps".

tracking-application-response-times

from ComeOnOliver/skillshub

Track and optimize application response times across API endpoints, database queries, and service calls. Use when monitoring performance or identifying bottlenecks. Trigger with phrases like "track response times", "monitor API performance", or "analyze latency".

setting-up-experiment-tracking

from ComeOnOliver/skillshub

Implement machine learning experiment tracking using MLflow or Weights & Biases. Configures environment and provides code for logging parameters, metrics, and artifacts. Use when asked to "setup experiment tracking" or "initialize MLflow". Trigger with relevant phrases based on skill purpose.

tracking-regression-tests

from ComeOnOliver/skillshub

This skill enables Claude to track and run regression tests, ensuring new changes don't break existing functionality. It is triggered when the user asks to "track regression", "run regression tests", or uses the shortcut "reg". The skill helps in maintaining code stability by identifying critical tests, automating their execution, and analyzing the impact of changes. It also provides insights into test history and identifies flaky tests. The skill uses the `regression-test-tracker` plugin.

profiling-application-performance

from ComeOnOliver/skillshub

Execute this skill enables AI assistant to profile application performance, analyzing cpu usage, memory consumption, and execution time. it is triggered when the user requests performance analysis, bottleneck identification, or optimization recommendations. the... Use when optimizing performance. Trigger with phrases like 'optimize', 'performance', or 'speed up'.

mlflow-tracking-setup

from ComeOnOliver/skillshub

Mlflow Tracking Setup - Auto-activating skill for ML Training. Triggers on: mlflow tracking setup, mlflow tracking setup Part of the ML Training skill category.