LLM Cost Optimizer

Audits an AI application for unnecessary token spend and recommends prompt caching, model routing, and token reduction techniques to cut costs.

8 stars

byNotysoty

View on GitHub Installation ↓

Best use case

LLM Cost Optimizer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Audits an AI application for unnecessary token spend and recommends prompt caching, model routing, and token reduction techniques to cut costs.

Teams using LLM Cost Optimizer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/llm-cost-optimizer/SKILL.md --create-dirs "https://raw.githubusercontent.com/Notysoty/openagentskills/main/skills/llm-cost-optimizer/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/llm-cost-optimizer/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How LLM Cost Optimizer Compares

Feature / Agent	LLM Cost Optimizer	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Audits an AI application for unnecessary token spend and recommends prompt caching, model routing, and token reduction techniques to cut costs.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# LLM Cost Optimizer

## What this skill does

This skill audits an LLM application's prompts, call patterns, and model selection to identify cost reduction opportunities. It covers prompt caching, model routing (right-sizing), token reduction, batching, and output length control — the techniques that typically cut LLM costs by 40–80% without sacrificing quality.

## How to use

### Claude Code / Cline

Copy this file to `.agents/skills/llm-cost-optimizer/SKILL.md` in your project root.

Then ask:
- *"Use the LLM Cost Optimizer to audit our AI application."*
- *"How can I reduce our OpenAI API costs? Here are our prompts..."*

Provide:
- Your system prompt(s)
- Approximate daily call volume
- Which model(s) you're using
- Typical input/output token counts if known
- Whether calls are real-time (low latency required) or batch (latency tolerant)

### Cursor / Codex

Paste your prompts, call patterns, and current monthly spend alongside these instructions.

## The Prompt / Instructions for the Agent

When asked to optimize LLM costs, audit the following areas in order of typical savings impact:

### Audit 1 — Prompt Caching (savings: 50–90% on repeated prefixes)

**Check:** Does the system prompt stay the same across calls?

If yes, enable prompt caching. The system prompt is sent once and cached — subsequent calls only pay for the new user tokens.

```python
# Anthropic Claude — cache_control on system prompt
response = client.messages.create(
    model="claude-opus-4-6",
    system=[{
        "type": "text",
        "text": your_system_prompt,
        "cache_control": {"type": "ephemeral"}  # cached for 5 minutes
    }],
    messages=[{"role": "user", "content": user_message}]
)

# OpenAI — automatic prompt caching for prompts > 1024 tokens
# No code change needed — cached automatically, check usage.prompt_tokens_details.cached_tokens
```

**When it applies:** Any app where the system prompt is > 1024 tokens and reused across calls. Support bots, coding assistants, document analyzers.

**Savings estimate:** If system prompt = 2000 tokens, 10,000 calls/day → saves ~20M tokens/day in input costs.

### Audit 2 — Model Right-Sizing (savings: 60–90% on over-specified models)

**Check:** Are you using a frontier model (GPT-4o, Claude Opus) for tasks that a smaller model handles just as well?

| Task | Recommended Model |
|---|---|
| Classification, routing, yes/no decisions | GPT-4o-mini, Claude Haiku |
| Summarization, extraction, translation | GPT-4o-mini, Claude Sonnet |
| Complex reasoning, code generation | GPT-4o, Claude Sonnet |
| Novel research, multi-step agent planning | Claude Opus, o1 |

**Implement a model router:**
```python
def route_model(task_type: str, complexity: str) -> str:
    if task_type in ("classify", "extract", "translate"):
        return "claude-haiku-4-5-20251001"
    if complexity == "high" or task_type == "code_generation":
        return "claude-sonnet-4-6"
    return "claude-haiku-4-5-20251001"  # default to cheap
```

### Audit 3 — Token Reduction (savings: 20–40% on bloated prompts)

**Check:** Is the system prompt longer than it needs to be?

Common bloat patterns:
- Repeating the same instruction multiple ways ("Be concise. Keep answers short. Don't ramble.")
- Long examples when one would do
- Full document context when only a section is needed
- Verbose role descriptions

**Token reduction techniques:**

1. **Compress examples** — use 1 example instead of 3 if the task is clear
2. **Use structured format** — bullet points use fewer tokens than prose instructions
3. **Trim RAG context** — retrieve top-3 chunks, not top-10; rerank before sending
4. **Limit output length** — set `max_tokens` to the minimum needed:
```python
# If you only need a one-sentence answer, cap it
response = client.messages.create(max_tokens=100, ...)
```

### Audit 4 — Response Caching (savings: 30–70% for repetitive queries)

**Check:** Do users ask similar questions repeatedly?

Cache model responses by a hash of the (system_prompt + user_input) pair:

```python
import hashlib, json

def get_cached_or_call(system: str, user: str) -> str:
    key = hashlib.sha256(f"{system}:{user}".encode()).hexdigest()
    cached = redis_client.get(key)
    if cached:
        return json.loads(cached)

    response = call_llm(system, user)
    redis_client.setex(key, 3600, json.dumps(response))  # cache 1hr
    return response
```

Use semantic similarity for fuzzy cache hits if exact-match cache rate is low.

### Audit 5 — Batching (savings: 50% cost + latency for async workloads)

**Check:** Are you running background jobs (document processing, bulk analysis) one-at-a-time?

Both OpenAI and Anthropic offer Batch APIs at 50% discount for async workloads:

```python
# Anthropic Batch API
batch = client.messages.batches.create(
    requests=[
        {"custom_id": f"doc_{i}", "params": {"model": "...", "messages": [...]}}
        for i, doc in enumerate(documents)
    ]
)
# Results available within 24hrs at 50% of standard price
```

Use when: processing 100+ documents, nightly summarization jobs, bulk classification.

### Audit 6 — Streaming Efficiency

**Check:** Are you streaming responses but storing the full output anyway?

If you don't need to stream to the user, disable streaming — it has slightly higher overhead for short responses. Only stream when showing real-time output to users.

### Cost Estimate Template

After auditing, produce a cost breakdown:

| Optimization | Monthly Savings Estimate | Effort |
|---|---|---|
| Prompt caching | $X | Low |
| Switch summarization to Haiku | $X | Low |
| Cap max_tokens on short-answer routes | $X | Low |
| Response caching (top 20% queries) | $X | Medium |
| Batch API for nightly jobs | $X | Medium |
| **Total** | **$X** | |

## Example

**Input:**
> "We use Claude Opus for everything. System prompt is 3000 tokens. We do 5000 calls/day for customer support — mostly classifying intent and drafting short replies."

**Output:**
> **Critical finding: Wrong model for workload.**
> Intent classification and short reply drafting = Haiku-level tasks. Switching to claude-haiku-4-5-20251001 saves ~85% per token.
>
> **Prompt caching:** 3000-token system prompt × 5000 calls = 15M cached tokens/day. Enable `cache_control` on your system prompt.
>
> **Combined monthly savings estimate: ~$2,800/month** based on Anthropic pricing, down from ~$3,400 to ~$600.

Related Skills

SQL Query Optimizer

from Notysoty/openagentskills

Reviews SQL queries for performance issues and rewrites them with optimized execution plans.

SEO Content Optimizer

from Notysoty/openagentskills

Analyzes and rewrites content to maximize search engine visibility without sounding robotic.

Form UX Optimizer

from Notysoty/openagentskills

Reviews web forms for usability issues — field ordering, validation messages, error states, and accessibility.

Unit Test Writer

from Notysoty/openagentskills

Generates comprehensive unit tests for any function or module with edge cases.

Unit Test Improver

from Notysoty/openagentskills

Reviews existing unit tests for gaps, weak assertions, and missing edge cases, then rewrites them to be more robust.

Troubleshooting Guide Builder

from Notysoty/openagentskills

Builds a structured troubleshooting guide with symptom → cause → fix format for any tool or system.

Tech Debt Auditor

from Notysoty/openagentskills

Identifies and prioritizes technical debt in a codebase with an effort/impact matrix.

Technical Blog Post Writer

from Notysoty/openagentskills

Writes engaging, accurate technical blog posts targeted at developer audiences.

Stack Trace Analyzer

from Notysoty/openagentskills

Interprets error stack traces to pinpoint root cause, explain what went wrong, and suggest fixes.

Sprint Summary Generator

from Notysoty/openagentskills

Converts a list of completed tickets or commits into a clear sprint summary for stakeholders.

Social Post Thread Writer

from Notysoty/openagentskills

Converts a blog post, idea, or document into an engaging Twitter/X or LinkedIn thread with hooks and CTAs.

SEO Metadata Generator

from Notysoty/openagentskills

Generates optimized title tags, meta descriptions, Open Graph tags, and structured data for any web page.