prompt-caching
Prompt caching strategies for LLM APIs — cache breakpoints, system prompt caching, and cost optimization.
Best use case
prompt-caching is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Prompt caching strategies for LLM APIs — cache breakpoints, system prompt caching, and cost optimization.
Teams using prompt-caching should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/prompt-caching/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How prompt-caching Compares
| Feature / Agent | prompt-caching | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Prompt caching strategies for LLM APIs — cache breakpoints, system prompt caching, and cost optimization.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Prompt Caching Strategies for LLM APIs
## Purpose
Optimize LLM API costs and latency by leveraging prompt caching features across providers. Covers Anthropic's cache breakpoints, OpenAI's automatic caching, cache-friendly prompt architecture, and cost modeling.
## Key Patterns
### Anthropic Prompt Caching
Anthropic supports explicit cache breakpoints on content blocks. Cached content is billed at a reduced rate on cache hits and a small write premium on cache misses.
**System prompt caching** — Place `cache_control` on the system message:
```typescript
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const response = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system: [
{
type: 'text',
text: `You are an expert assistant with deep knowledge of our codebase.
Here is the full project documentation:
${largeDocumentation}`, // Large static content
cache_control: { type: 'ephemeral' },
},
],
messages: [{ role: 'user', content: 'How do I add a new API endpoint?' }],
});
// Check cache performance in response headers
// response.usage.cache_creation_input_tokens — tokens written to cache
// response.usage.cache_read_input_tokens — tokens read from cache
```
**Multi-turn conversation caching** — Cache the conversation prefix:
```typescript
async function cachedMultiTurn(
systemPrompt: string,
conversationHistory: Anthropic.Messages.MessageParam[],
newMessage: string
) {
// Strategy: cache the system prompt + all previous turns
// Only the new user message is uncached
const messages: Anthropic.Messages.MessageParam[] = [
...conversationHistory.map((msg, i) => {
if (i === conversationHistory.length - 1) {
// Place cache breakpoint on the last historical message
return {
...msg,
content:
typeof msg.content === 'string'
? [
{
type: 'text' as const,
text: msg.content,
cache_control: { type: 'ephemeral' as const },
},
]
: msg.content,
};
}
return msg;
}),
{ role: 'user', content: newMessage },
];
return client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
system: [
{
type: 'text',
text: systemPrompt,
cache_control: { type: 'ephemeral' },
},
],
messages,
});
}
```
**Tool definition caching** — Cache large tool arrays:
```typescript
const response = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
system: [
{
type: 'text',
text: systemPrompt,
cache_control: { type: 'ephemeral' },
},
],
tools: largeToolArray, // Tools are cached as part of the system turn
messages,
});
```
### OpenAI Automatic Caching
OpenAI caches prompts automatically when the prefix matches a previous request. No explicit cache control needed, but prompt structure matters.
**Optimize for prefix matching** — Keep static content at the beginning:
```typescript
import OpenAI from 'openai';
const openai = new OpenAI();
// GOOD: Static prefix, variable suffix
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
// Large static system prompt — cached automatically if prefix matches
content: `${largeStaticInstructions}\n\n${staticContext}`,
},
// Previous conversation turns — stable prefix
...previousMessages,
// New message — only this varies
{ role: 'user', content: newUserMessage },
],
});
// Cached tokens shown in usage:
// response.usage.prompt_tokens_details.cached_tokens
```
### Cache-Friendly Prompt Architecture
**Layer your prompts** — Place content in order of stability:
```
Layer 1 (most stable): System instructions, personality, rules
Layer 2 (stable): Reference documents, RAG context, tool definitions
Layer 3 (semi-stable): Conversation history
Layer 4 (volatile): Current user message
```
```typescript
// Template for cache-optimized prompt construction
function buildCacheOptimizedPrompt(config: {
systemRules: string; // Layer 1 - rarely changes
referenceContext: string; // Layer 2 - changes per session
conversationHistory: Message[]; // Layer 3 - grows per turn
userMessage: string; // Layer 4 - changes every call
}) {
return {
system: [
{
type: 'text' as const,
text: config.systemRules,
cache_control: { type: 'ephemeral' as const },
},
{
type: 'text' as const,
text: config.referenceContext,
cache_control: { type: 'ephemeral' as const },
},
],
messages: [
...config.conversationHistory,
{ role: 'user' as const, content: config.userMessage },
],
};
}
```
### Cost Modeling
**Anthropic pricing model (approximate):**
| Token Type | Relative Cost |
|------------|--------------|
| Regular input | 1x (base) |
| Cache write | 1.25x (25% premium) |
| Cache read | 0.1x (90% discount) |
| Output | ~5x input (varies by model) |
```typescript
// Calculate expected savings
function estimateCacheSavings(config: {
cachedTokens: number;
uncachedTokens: number;
turnsPerSession: number;
inputPricePerMToken: number; // e.g., $3 for Sonnet
}) {
const { cachedTokens, uncachedTokens, turnsPerSession, inputPricePerMToken } = config;
// Without caching: all tokens charged at full rate every turn
const noCacheCost =
((cachedTokens + uncachedTokens) * turnsPerSession * inputPricePerMToken) / 1_000_000;
// With caching:
// Turn 1: cache write (1.25x) + uncached (1x)
// Turn 2+: cache read (0.1x) + uncached (1x)
const cacheWriteCost = (cachedTokens * 1.25 * inputPricePerMToken) / 1_000_000;
const cacheReadCost =
(cachedTokens * 0.1 * (turnsPerSession - 1) * inputPricePerMToken) / 1_000_000;
const uncachedCost =
(uncachedTokens * turnsPerSession * inputPricePerMToken) / 1_000_000;
const withCacheCost = cacheWriteCost + cacheReadCost + uncachedCost;
return {
withoutCache: noCacheCost,
withCache: withCacheCost,
savings: noCacheCost - withCacheCost,
savingsPercent: ((noCacheCost - withCacheCost) / noCacheCost) * 100,
};
}
// Example: 10k cached tokens, 500 uncached, 10 turns, $3/M tokens
// Savings: ~85% on the cached portion
```
### Cache Invalidation Awareness
```typescript
// Anthropic ephemeral caches have a TTL (typically 5 minutes)
// Design your system to re-warm caches for active sessions
class CacheWarmingManager {
private lastCallTime = new Map<string, number>();
private readonly CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes
shouldRewarm(sessionId: string): boolean {
const last = this.lastCallTime.get(sessionId);
if (!last) return false;
return Date.now() - last > this.CACHE_TTL_MS * 0.8; // Re-warm at 80% TTL
}
recordCall(sessionId: string) {
this.lastCallTime.set(sessionId, Date.now());
}
// Send a minimal request to keep the cache warm
async keepWarm(sessionId: string, cachedSystem: string) {
if (this.shouldRewarm(sessionId)) {
await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1,
system: [
{
type: 'text',
text: cachedSystem,
cache_control: { type: 'ephemeral' },
},
],
messages: [{ role: 'user', content: 'ping' }],
});
this.recordCall(sessionId);
}
}
}
```
### Minimum Token Thresholds
Anthropic requires a minimum number of tokens for caching to activate:
| Model | Minimum Tokens |
|-------|---------------|
| Claude Sonnet | 1,024 |
| Claude Haiku | 2,048 |
| Claude Opus | 1,024 |
```typescript
// Check if content meets caching threshold
function shouldCache(content: string, model: string): boolean {
// Rough token estimate: ~4 chars per token
const estimatedTokens = Math.ceil(content.length / 4);
const thresholds: Record<string, number> = {
'claude-sonnet-4-20250514': 1024,
'claude-haiku-4-20250414': 2048,
'claude-opus-4-20250514': 1024,
};
return estimatedTokens >= (thresholds[model] ?? 1024);
}
```
## Best Practices
1. **Place the most stable content first** — System instructions and reference docs should be the prefix; user messages go last.
2. **Use at most 4 cache breakpoints** — Anthropic supports up to 4 `cache_control` markers; place them at natural content boundaries.
3. **Measure cache hit rates** — Track `cache_read_input_tokens` vs `cache_creation_input_tokens` to verify your strategy works.
4. **Avoid mutating cached content** — Even a single character change invalidates the cache for all downstream content.
5. **Bundle reference documents together** — Combine multiple small docs into one large cached block rather than many small ones.
6. **Account for cache write cost** — For single-use prompts, caching adds 25% cost with no benefit; only cache repeated content.
7. **Keep user-specific data outside cached blocks** — User names, IDs, and dynamic values should come after the cache breakpoint.
8. **Monitor TTL expiry** — Anthropic caches expire after ~5 minutes of inactivity; long idle sessions lose cache benefits.
## Common Pitfalls
| Pitfall | Problem | Fix |
|---------|---------|-----|
| Caching single-use prompts | 25% write premium with zero reads | Only cache content reused across turns |
| Dynamic content in cached block | Cache miss every call | Move dynamic content after the breakpoint |
| Below minimum token threshold | Cache silently not created | Ensure cached content meets model-specific minimums |
| Too many small cached blocks | Sub-optimal cache utilization | Consolidate into fewer, larger blocks |
| Ignoring cache metrics | No visibility into cost savings | Log and dashboard `cache_read_input_tokens` per session |
| Cache warming too aggressively | Extra API costs from keep-alive calls | Only warm for active sessions with high-value caches |Related Skills
promptfoo
LLM red teaming and security testing — automated vulnerability scanning for AI agents, RAGs, and LLM pipelines. Covers prompt injection, jailbreaks, data leaks, PII exposure, and 50+ vulnerability types.
prompt-engineering
Prompt design, chain-of-thought, few-shot learning, system prompts, and structured output patterns
caching
Design and implement caching strategies across all layers — in-memory, distributed (Redis/Memcached), CDN, HTTP cache headers, and application-level memoization
ai-prompts
AI prompt library covering system prompts, few-shot templates, structured output schemas, and prompt engineering patterns for production LLM applications
ultrathink
UltraThink Workflow OS — 4-layer skill mesh with persistent memory and privacy hooks for complex engineering tasks. Routes prompts through intent detection to activate the right domain skills automatically.
ultrathink_review
Multi-pass code review powered by UltraThink's quality gate — checks correctness, security (OWASP), performance, readability, and project conventions in a single structured pass.
ultrathink_memory
Persistent memory system for UltraThink — search, save, and recall project context, decisions, and patterns across sessions using Postgres-backed fuzzy search with synonym expansion.
ui-design
Comprehensive UI design system: 230+ font pairings, 48 themes, 65 design systems, 23 design languages, 30 UX laws, 14 color systems, Swiss grid, Gestalt principles, Pencil.dev workflow. Inherits ui-ux-pro-max (99 UX rules) + impeccable-frontend-design (anti-AI-slop). Triggers on any design, UI, layout, typography, color, theme, or styling task.
Zod
> TypeScript-first schema validation with static type inference.
webinar-registration-page
Build a webinar or live event registration page as a self-contained HTML file with countdown timer, speaker bio, agenda, and registration form. Triggers on: "build a webinar registration page", "create a webinar sign-up page", "event registration landing page", "live training registration page", "workshop sign-up page", "create a webinar page", "build an event page", "free webinar landing page", "live demo registration page", "online event page", "create a registration page for my webinar", "build a training event page".
webhooks
Webhook design patterns — delivery, retry with exponential backoff, HMAC signature verification, payload validation, idempotency keys
web-workers
Offload heavy computation from the main thread using Web Workers, SharedWorkers, and Comlink — structured messaging, transferable objects, and off-main-thread architecture patterns