llm-caching

Optimize LLM costs and latency through KV caching and prompt caching. Use when (1) structuring prompts for cache hits, (2) configuring API cache_control for Anthropic/Cohere/OpenAI/Gemini, (3) setting up self-hosted inference with vLLM/SGLang/Ollama, (4) building agentic workflows with prefix reuse, (5) designing batch processing pipelines, or (6) understanding cache pricing and tradeoffs.

16 stars

Best use case

llm-caching is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Optimize LLM costs and latency through KV caching and prompt caching. Use when (1) structuring prompts for cache hits, (2) configuring API cache_control for Anthropic/Cohere/OpenAI/Gemini, (3) setting up self-hosted inference with vLLM/SGLang/Ollama, (4) building agentic workflows with prefix reuse, (5) designing batch processing pipelines, or (6) understanding cache pricing and tradeoffs.

Teams using llm-caching should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/llm-caching/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/llm-caching/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/llm-caching/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How llm-caching Compares

Feature / Agentllm-cachingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Optimize LLM costs and latency through KV caching and prompt caching. Use when (1) structuring prompts for cache hits, (2) configuring API cache_control for Anthropic/Cohere/OpenAI/Gemini, (3) setting up self-hosted inference with vLLM/SGLang/Ollama, (4) building agentic workflows with prefix reuse, (5) designing batch processing pipelines, or (6) understanding cache pricing and tradeoffs.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# LLM Caching

Maximize KV cache reuse to reduce costs and latency.

## Core Concept

LLMs compute Key (K) and Value (V) vectors for each token during inference. These encode the model's "understanding" of context. Caching avoids recomputation.

```
Level 1: KV Cache (inference)     - Within one generation, reuse previous tokens' K,V
Level 2: Prompt Cache (API)       - Across requests, persist KV state server-side
Level 3: Prefix Sharing (batch)   - Across users/requests, share common prefixes
```

## The Golden Rule

**Static content first, variable content last.**

```
[System prompt]         <- cacheable, same every request
[Tool definitions]      <- cacheable
[Few-shot examples]     <- cacheable (same order!)
[Reference documents]   <- cacheable if stable
[User message]          <- variable, at the end
```

Cache hits require the **prefix** (beginning) to match exactly. Any difference breaks caching for everything after.

## Prompt Structure Template

```
┌─────────────────────────────────────┐
│  1. System instructions (static)    │  <- cache_control
├─────────────────────────────────────┤
│  2. Tool definitions (static)       │  <- cache_control
├─────────────────────────────────────┤
│  3. Few-shot examples (static)      │  <- cache_control
├─────────────────────────────────────┤
│  4. Documents/context (semi-static) │  <- cache_control if reused
├─────────────────────────────────────┤
│  5. Conversation history (growing)  │  <- cache after N turns
├─────────────────────────────────────┤
│  6. Current user message (variable) │  <- no caching
└─────────────────────────────────────┘
```

## Anti-Patterns

| Anti-Pattern | Why It Breaks Caching |
|--------------|----------------------|
| Variable content early | Prefix changes every request |
| Randomizing few-shot order | Different order = different prefix |
| Timestamps in system prompt | Changes every request |
| User ID in prefix | Per-user cache = no sharing |
| Prompts < minimum threshold | Too small to cache (1024 tokens for Claude) |
| Shuffling tool definitions | Tool order is part of prefix |

## Cost Impact

| Operation | Typical Pricing | Notes |
|-----------|-----------------|-------|
| Cache write | ~1.25x input | One-time, stores KV state |
| Cache read | ~0.1x input | 90% savings on cache hit |
| No caching | 1x input | Full recomputation every time |

**Example:** 50k token system prompt, 100 requests
- Without cache: 50k × 100 × $3/1M = $15.00
- With cache: 50k × $3.75/1M + 50k × 99 × $0.30/1M = $1.67 (**89% savings**)

## Provider References

- **Anthropic Claude** (recommended): [references/claude.md](references/claude.md)
- **Cohere**: [references/cohere.md](references/cohere.md)
- **Self-hosted (vLLM, SGLang, Ollama, HuggingFace)**: [references/self-hosted.md](references/self-hosted.md)
- **OpenAI**: [references/openai.md](references/openai.md)
- **Google Gemini**: [references/gemini.md](references/gemini.md)

## Cookbooks

Practical examples: [references/cookbooks.md](references/cookbooks.md)

| Pattern | Key Insight |
|---------|-------------|
| Web scraping agent | Same tools + system prompt, different URLs |
| RAG pipeline | Cache document chunks, vary queries |
| Multi-turn chat | Growing prefix, cache conversation history |
| Batch processing | Same prompt template, different inputs |
| Agentic tool use | Cache tool definitions + examples |
| Multi-tenant SaaS | Shared base prompt, tenant-specific suffix |

Related Skills

gitlab-ci-artifacts-caching

16
from diegosouzapw/awesome-omni-skill

Use when configuring artifacts for inter-job data passing or caching for faster builds. Covers cache strategies and artifact management.

apollo-caching-strategies

16
from diegosouzapw/awesome-omni-skill

Use when implementing Apollo caching strategies including cache policies, optimistic UI, cache updates, and normalization.

prompt-caching

16
from diegosouzapw/awesome-omni-skill

Caching strategies for LLM prompts including Anthropic prompt caching, response caching, and CAG (Cache Augmented Generation) Use when: prompt caching, cache prompt, response cache, cag, cache augmented.

bgo

10
from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

obsidian-daily

16
from diegosouzapw/awesome-omni-skill

Manage Obsidian Daily Notes via obsidian-cli. Create and open daily notes, append entries (journals, logs, tasks, links), read past notes by date, and search vault content. Handles relative dates like "yesterday", "last Friday", "3 days ago".

obsidian-additions

16
from diegosouzapw/awesome-omni-skill

Create supplementary materials attached to existing notes: experiments, meetings, reports, logs, conspectuses, practice sessions, annotations, AI outputs, links collections. Two-step process: (1) create aggregator space, (2) create concrete addition in base/additions/. INVOKE when user wants to attach any supplementary material to an existing note. Triggers: "addition", "create addition", "experiment", "meeting notes", "report", "conspectus", "log", "practice", "annotations", "links", "link collection", "аддишн", "конспект", "встреча", "отчёт", "эксперимент", "практика", "аннотации", "ссылки", "добавь к заметке".

observe

16
from diegosouzapw/awesome-omni-skill

Query and manage Observe using the Observe CLI. Use when the user wants to run OPAL queries, list datasets, manage objects, or interact with their Observe tenant from the command line.

observability-review

16
from diegosouzapw/awesome-omni-skill

AI agent that analyzes operational signals (metrics, logs, traces, alerts, SLO/SLI reports) from observability platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana, Elastic) and produces practical, risk-aware triage and recommendations. Use when reviewing system health, investigating performance issues, analyzing monitoring data, evaluating service reliability, or providing SRE analysis of operational metrics. Distinguishes between critical issues requiring action, items needing investigation, and informational observations requiring no action.

nvidia-nim

16
from diegosouzapw/awesome-omni-skill

NVIDIA NIM inference microservices for deploying AI models with OpenAI-compatible APIs, self-hosted or cloud

numpy-string-ops

16
from diegosouzapw/awesome-omni-skill

Vectorized string manipulation using the char module and modern string alternatives, including cleaning and search operations. Triggers: string operations, numpy.char, text cleaning, substring search.

nova-act-usability

16
from diegosouzapw/awesome-omni-skill

AI-orchestrated usability testing using Amazon Nova Act. The agent generates personas, runs tests to collect raw data, interprets responses to determine goal achievement, and generates HTML reports. Tests real user workflows (booking, checkout, posting) with safety guardrails. Use when asked to "test website usability", "run usability test", "generate usability report", "evaluate user experience", "test checkout flow", "test booking process", or "analyze website UX".

notebook-writer

16
from diegosouzapw/awesome-omni-skill

Create and document Jupyter notebooks for reproducible analyses