cost-aware-pipeline

Cost optimization for LLM pipelines including model routing, prompt caching, token budgets, and retry logic for Claude API usage.

39 stars

byInugamiDev

View on GitHub Installation ↓

Best use case

cost-aware-pipeline is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Cost optimization for LLM pipelines including model routing, prompt caching, token budgets, and retry logic for Claude API usage.

Teams using cost-aware-pipeline should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/cost-aware-pipeline/SKILL.md --create-dirs "https://raw.githubusercontent.com/InugamiDev/ultrathink-oss/main/.claude/skills/cost-aware-pipeline/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/cost-aware-pipeline/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How cost-aware-pipeline Compares

Feature / Agent	cost-aware-pipeline	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Cost optimization for LLM pipelines including model routing, prompt caching, token budgets, and retry logic for Claude API usage.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Cost-Aware Pipeline — LLM Cost Optimization

Optimize cost, latency, and quality across LLM pipelines.

## Model Routing by Task

Route tasks to the cheapest model that meets quality requirements.

| Task Type | Model | Why | Cost/MTok (input) |
|-----------|-------|-----|-------------------|
| Classification, extraction | Haiku | Fast, cheap, sufficient | $0.25 |
| Summarization, simple Q&A | Haiku | Good enough quality | $0.25 |
| Code generation, refactoring | Sonnet | Best code quality/cost ratio | $3.00 |
| Code review, debugging | Sonnet | Solid reasoning for code | $3.00 |
| Architecture, planning | Opus | Deep reasoning needed | $15.00 |
| Complex analysis, research | Opus | Multi-step reasoning | $15.00 |
| Safety-critical decisions | Opus | Highest reliability | $15.00 |

**UltraThink note**: Per user preference — Opus for thinking/planning, Sonnet for coding/implementing. No Haiku for user-facing tasks (Haiku only for internal pipeline stages).

### Routing Logic

```
if task.requires_deep_reasoning:
    model = "opus"
elif task.is_code or task.is_implementation:
    model = "sonnet"
elif task.is_simple_extraction or task.is_classification:
    model = "haiku"
else:
    model = "sonnet"  # safe default
```

## Prompt Caching

Cache static context to reduce costs on repeated calls.

### What to Cache
- System prompts (amortized across many calls)
- Long documents being analyzed (multiple questions against same doc)
- Few-shot examples (reused across similar tasks)
- Tool schemas (same across all calls in a session)

### Cache Strategy
```
# Mark cache breakpoints in API calls
system_prompt = [
  {"type": "text", "text": static_instructions, "cache_control": {"type": "ephemeral"}},
  {"type": "text", "text": dynamic_context}
]
```

**Cache pricing** (Claude):
- Cache write: 1.25× base input price
- Cache read: 0.1× base input price (90% savings)
- Cache TTL: 5 minutes (refreshed on use)

## Token Budget Management

### Per-Request Budgets

```
max_tokens_by_task = {
  "classification": 100,
  "extraction": 500,
  "code_generation": 4000,
  "analysis": 2000,
  "planning": 3000,
}
```

### Session Budget Tracking

```python
class BudgetTracker:
    def __init__(self, max_cost_usd: float):
        self.max_cost = max_cost_usd
        self.spent = 0.0

    def can_proceed(self, estimated_cost: float) -> bool:
        return self.spent + estimated_cost <= self.max_cost

    def record(self, input_tokens: int, output_tokens: int, model: str):
        self.spent += calculate_cost(input_tokens, output_tokens, model)
```

### Cost Estimation

```
estimated_cost = (input_tokens × input_price + output_tokens × output_price) / 1_000_000
```

## Retry Logic

Only retry on transient errors. Never retry on:
- 400 (bad request) — fix the request
- 401/403 (auth) — fix credentials
- 429 sustained — back off and reduce rate

### Retry Strategy

```
retryable_errors = [429, 500, 502, 503, 529]

for attempt in range(max_retries):
    try:
        response = call_api(...)
        break
    except APIError as e:
        if e.status not in retryable_errors:
            raise  # Don't retry non-transient errors
        wait = min(base_delay * (2 ** attempt), max_delay)
        sleep(wait + random_jitter)
```

## Pricing Reference (Claude, as of 2025)

| Model | Input/MTok | Output/MTok | Context |
|-------|-----------|------------|---------|
| Haiku 3.5 | $0.80 | $4.00 | 200K |
| Sonnet 4 | $3.00 | $15.00 | 200K |
| Opus 4 | $15.00 | $75.00 | 200K |

*Extended thinking multiplies output cost. Prompt caching reduces input cost by up to 90%.*

## Pipeline Design Patterns

### Cascade (cheap → expensive)
Try Haiku first. If confidence < threshold, escalate to Sonnet. If still uncertain, escalate to Opus.
**Saves**: 60-80% on tasks where Haiku suffices.

### Fan-out (parallel cheap, merge expensive)
Run N Haiku calls in parallel, merge results with one Sonnet call.
**Saves**: Avoids one expensive call for embarrassingly parallel tasks.

### Critic Loop (generate cheap, review expensive)
Generate with Sonnet, review with Opus. Fix with Sonnet. Repeat until Opus approves.
**Saves**: Opus only reads, never generates (output tokens are 5× more expensive).

## UltraThink Integration

- Use with `autonomous-loops` to set cost budgets on loop patterns
- Use with `context-budget` to audit where tokens are being consumed
- VFS reduces token consumption by 60-98% — always prefer over full file reads
- Memory system avoids re-discovering context across sessions (amortized cost)
- Tekiō tracks cost patterns: if a loop consistently overruns budget, it adapts

Related Skills

/forge — Product Builder Pipeline (GSD Template)

from InugamiDev/ultrathink-oss

> GSD template for product-building lifecycle. Forge = GSD with product-specific presets.

ultrathink

from InugamiDev/ultrathink-oss

UltraThink Workflow OS — 4-layer skill mesh with persistent memory and privacy hooks for complex engineering tasks. Routes prompts through intent detection to activate the right domain skills automatically.

ultrathink_review

from InugamiDev/ultrathink-oss

Multi-pass code review powered by UltraThink's quality gate — checks correctness, security (OWASP), performance, readability, and project conventions in a single structured pass.

ultrathink_memory

from InugamiDev/ultrathink-oss

Persistent memory system for UltraThink — search, save, and recall project context, decisions, and patterns across sessions using Postgres-backed fuzzy search with synonym expansion.

ui-design

from InugamiDev/ultrathink-oss

Comprehensive UI design system: 230+ font pairings, 48 themes, 65 design systems, 23 design languages, 30 UX laws, 14 color systems, Swiss grid, Gestalt principles, Pencil.dev workflow. Inherits ui-ux-pro-max (99 UX rules) + impeccable-frontend-design (anti-AI-slop). Triggers on any design, UI, layout, typography, color, theme, or styling task.

Zod

from InugamiDev/ultrathink-oss

> TypeScript-first schema validation with static type inference.

webinar-registration-page

from InugamiDev/ultrathink-oss

Build a webinar or live event registration page as a self-contained HTML file with countdown timer, speaker bio, agenda, and registration form. Triggers on: "build a webinar registration page", "create a webinar sign-up page", "event registration landing page", "live training registration page", "workshop sign-up page", "create a webinar page", "build an event page", "free webinar landing page", "live demo registration page", "online event page", "create a registration page for my webinar", "build a training event page".

webhooks

from InugamiDev/ultrathink-oss

Webhook design patterns — delivery, retry with exponential backoff, HMAC signature verification, payload validation, idempotency keys

web-workers

from InugamiDev/ultrathink-oss

Offload heavy computation from the main thread using Web Workers, SharedWorkers, and Comlink — structured messaging, transferable objects, and off-main-thread architecture patterns

web-vitals

from InugamiDev/ultrathink-oss

Core Web Vitals monitoring (LCP, FID, CLS, INP, TTFB), measurement with web-vitals library, reporting to analytics, and optimization strategies for Next.js

web-components

from InugamiDev/ultrathink-oss

Native Web Components, custom elements API, Shadow DOM, HTML templates, slots, lifecycle callbacks, and framework-agnostic design patterns

wasm

from InugamiDev/ultrathink-oss

WebAssembly integration — Rust to WASM with wasm-pack/wasm-bindgen, WASI, browser usage, server-side WASM, and performance considerations