prompt-caching

Prompt caching strategies for LLM APIs — cache breakpoints, system prompt caching, and cost optimization.

39 stars

byInugamiDev

View on GitHub Installation ↓

Best use case

prompt-caching is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Prompt caching strategies for LLM APIs — cache breakpoints, system prompt caching, and cost optimization.

Teams using prompt-caching should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/prompt-caching/SKILL.md --create-dirs "https://raw.githubusercontent.com/InugamiDev/ultrathink-oss/main/.claude/skills/prompt-caching/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/prompt-caching/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How prompt-caching Compares

Feature / Agent	prompt-caching	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Prompt caching strategies for LLM APIs — cache breakpoints, system prompt caching, and cost optimization.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Prompt Caching Strategies for LLM APIs

## Purpose

Optimize LLM API costs and latency by leveraging prompt caching features across providers. Covers Anthropic's cache breakpoints, OpenAI's automatic caching, cache-friendly prompt architecture, and cost modeling.

## Key Patterns

### Anthropic Prompt Caching

Anthropic supports explicit cache breakpoints on content blocks. Cached content is billed at a reduced rate on cache hits and a small write premium on cache misses.

**System prompt caching** — Place `cache_control` on the system message:

```typescript
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const response = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: `You are an expert assistant with deep knowledge of our codebase.
Here is the full project documentation:
${largeDocumentation}`, // Large static content
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [{ role: 'user', content: 'How do I add a new API endpoint?' }],
});

// Check cache performance in response headers
// response.usage.cache_creation_input_tokens — tokens written to cache
// response.usage.cache_read_input_tokens — tokens read from cache
```

**Multi-turn conversation caching** — Cache the conversation prefix:

```typescript
async function cachedMultiTurn(
  systemPrompt: string,
  conversationHistory: Anthropic.Messages.MessageParam[],
  newMessage: string
) {
  // Strategy: cache the system prompt + all previous turns
  // Only the new user message is uncached
  const messages: Anthropic.Messages.MessageParam[] = [
    ...conversationHistory.map((msg, i) => {
      if (i === conversationHistory.length - 1) {
        // Place cache breakpoint on the last historical message
        return {
          ...msg,
          content:
            typeof msg.content === 'string'
              ? [
                  {
                    type: 'text' as const,
                    text: msg.content,
                    cache_control: { type: 'ephemeral' as const },
                  },
                ]
              : msg.content,
        };
      }
      return msg;
    }),
    { role: 'user', content: newMessage },
  ];

  return client.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 4096,
    system: [
      {
        type: 'text',
        text: systemPrompt,
        cache_control: { type: 'ephemeral' },
      },
    ],
    messages,
  });
}
```

**Tool definition caching** — Cache large tool arrays:

```typescript
const response = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 4096,
  system: [
    {
      type: 'text',
      text: systemPrompt,
      cache_control: { type: 'ephemeral' },
    },
  ],
  tools: largeToolArray, // Tools are cached as part of the system turn
  messages,
});
```

### OpenAI Automatic Caching

OpenAI caches prompts automatically when the prefix matches a previous request. No explicit cache control needed, but prompt structure matters.

**Optimize for prefix matching** — Keep static content at the beginning:

```typescript
import OpenAI from 'openai';

const openai = new OpenAI();

// GOOD: Static prefix, variable suffix
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    {
      role: 'system',
      // Large static system prompt — cached automatically if prefix matches
      content: `${largeStaticInstructions}\n\n${staticContext}`,
    },
    // Previous conversation turns — stable prefix
    ...previousMessages,
    // New message — only this varies
    { role: 'user', content: newUserMessage },
  ],
});

// Cached tokens shown in usage:
// response.usage.prompt_tokens_details.cached_tokens
```

### Cache-Friendly Prompt Architecture

**Layer your prompts** — Place content in order of stability:

```
Layer 1 (most stable): System instructions, personality, rules
Layer 2 (stable):       Reference documents, RAG context, tool definitions
Layer 3 (semi-stable):  Conversation history
Layer 4 (volatile):     Current user message
```

```typescript
// Template for cache-optimized prompt construction
function buildCacheOptimizedPrompt(config: {
  systemRules: string;        // Layer 1 - rarely changes
  referenceContext: string;   // Layer 2 - changes per session
  conversationHistory: Message[]; // Layer 3 - grows per turn
  userMessage: string;        // Layer 4 - changes every call
}) {
  return {
    system: [
      {
        type: 'text' as const,
        text: config.systemRules,
        cache_control: { type: 'ephemeral' as const },
      },
      {
        type: 'text' as const,
        text: config.referenceContext,
        cache_control: { type: 'ephemeral' as const },
      },
    ],
    messages: [
      ...config.conversationHistory,
      { role: 'user' as const, content: config.userMessage },
    ],
  };
}
```

### Cost Modeling

**Anthropic pricing model (approximate):**

| Token Type | Relative Cost |
|------------|--------------|
| Regular input | 1x (base) |
| Cache write | 1.25x (25% premium) |
| Cache read | 0.1x (90% discount) |
| Output | ~5x input (varies by model) |

```typescript
// Calculate expected savings
function estimateCacheSavings(config: {
  cachedTokens: number;
  uncachedTokens: number;
  turnsPerSession: number;
  inputPricePerMToken: number; // e.g., $3 for Sonnet
}) {
  const { cachedTokens, uncachedTokens, turnsPerSession, inputPricePerMToken } = config;

  // Without caching: all tokens charged at full rate every turn
  const noCacheCost =
    ((cachedTokens + uncachedTokens) * turnsPerSession * inputPricePerMToken) / 1_000_000;

  // With caching:
  // Turn 1: cache write (1.25x) + uncached (1x)
  // Turn 2+: cache read (0.1x) + uncached (1x)
  const cacheWriteCost = (cachedTokens * 1.25 * inputPricePerMToken) / 1_000_000;
  const cacheReadCost =
    (cachedTokens * 0.1 * (turnsPerSession - 1) * inputPricePerMToken) / 1_000_000;
  const uncachedCost =
    (uncachedTokens * turnsPerSession * inputPricePerMToken) / 1_000_000;
  const withCacheCost = cacheWriteCost + cacheReadCost + uncachedCost;

  return {
    withoutCache: noCacheCost,
    withCache: withCacheCost,
    savings: noCacheCost - withCacheCost,
    savingsPercent: ((noCacheCost - withCacheCost) / noCacheCost) * 100,
  };
}

// Example: 10k cached tokens, 500 uncached, 10 turns, $3/M tokens
// Savings: ~85% on the cached portion
```

### Cache Invalidation Awareness

```typescript
// Anthropic ephemeral caches have a TTL (typically 5 minutes)
// Design your system to re-warm caches for active sessions

class CacheWarmingManager {
  private lastCallTime = new Map<string, number>();
  private readonly CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes

  shouldRewarm(sessionId: string): boolean {
    const last = this.lastCallTime.get(sessionId);
    if (!last) return false;
    return Date.now() - last > this.CACHE_TTL_MS * 0.8; // Re-warm at 80% TTL
  }

  recordCall(sessionId: string) {
    this.lastCallTime.set(sessionId, Date.now());
  }

  // Send a minimal request to keep the cache warm
  async keepWarm(sessionId: string, cachedSystem: string) {
    if (this.shouldRewarm(sessionId)) {
      await client.messages.create({
        model: 'claude-sonnet-4-20250514',
        max_tokens: 1,
        system: [
          {
            type: 'text',
            text: cachedSystem,
            cache_control: { type: 'ephemeral' },
          },
        ],
        messages: [{ role: 'user', content: 'ping' }],
      });
      this.recordCall(sessionId);
    }
  }
}
```

### Minimum Token Thresholds

Anthropic requires a minimum number of tokens for caching to activate:

| Model | Minimum Tokens |
|-------|---------------|
| Claude Sonnet | 1,024 |
| Claude Haiku | 2,048 |
| Claude Opus | 1,024 |

```typescript
// Check if content meets caching threshold
function shouldCache(content: string, model: string): boolean {
  // Rough token estimate: ~4 chars per token
  const estimatedTokens = Math.ceil(content.length / 4);
  const thresholds: Record<string, number> = {
    'claude-sonnet-4-20250514': 1024,
    'claude-haiku-4-20250414': 2048,
    'claude-opus-4-20250514': 1024,
  };
  return estimatedTokens >= (thresholds[model] ?? 1024);
}
```

## Best Practices

1. **Place the most stable content first** — System instructions and reference docs should be the prefix; user messages go last.
2. **Use at most 4 cache breakpoints** — Anthropic supports up to 4 `cache_control` markers; place them at natural content boundaries.
3. **Measure cache hit rates** — Track `cache_read_input_tokens` vs `cache_creation_input_tokens` to verify your strategy works.
4. **Avoid mutating cached content** — Even a single character change invalidates the cache for all downstream content.
5. **Bundle reference documents together** — Combine multiple small docs into one large cached block rather than many small ones.
6. **Account for cache write cost** — For single-use prompts, caching adds 25% cost with no benefit; only cache repeated content.
7. **Keep user-specific data outside cached blocks** — User names, IDs, and dynamic values should come after the cache breakpoint.
8. **Monitor TTL expiry** — Anthropic caches expire after ~5 minutes of inactivity; long idle sessions lose cache benefits.

## Common Pitfalls

| Pitfall | Problem | Fix |
|---------|---------|-----|
| Caching single-use prompts | 25% write premium with zero reads | Only cache content reused across turns |
| Dynamic content in cached block | Cache miss every call | Move dynamic content after the breakpoint |
| Below minimum token threshold | Cache silently not created | Ensure cached content meets model-specific minimums |
| Too many small cached blocks | Sub-optimal cache utilization | Consolidate into fewer, larger blocks |
| Ignoring cache metrics | No visibility into cost savings | Log and dashboard `cache_read_input_tokens` per session |
| Cache warming too aggressively | Extra API costs from keep-alive calls | Only warm for active sessions with high-value caches |

Related Skills

promptfoo

from InugamiDev/ultrathink-oss

LLM red teaming and security testing — automated vulnerability scanning for AI agents, RAGs, and LLM pipelines. Covers prompt injection, jailbreaks, data leaks, PII exposure, and 50+ vulnerability types.

prompt-engineering

from InugamiDev/ultrathink-oss

Prompt design, chain-of-thought, few-shot learning, system prompts, and structured output patterns

caching

from InugamiDev/ultrathink-oss

Design and implement caching strategies across all layers — in-memory, distributed (Redis/Memcached), CDN, HTTP cache headers, and application-level memoization

ai-prompts

from InugamiDev/ultrathink-oss

AI prompt library covering system prompts, few-shot templates, structured output schemas, and prompt engineering patterns for production LLM applications

ultrathink

from InugamiDev/ultrathink-oss

UltraThink Workflow OS — 4-layer skill mesh with persistent memory and privacy hooks for complex engineering tasks. Routes prompts through intent detection to activate the right domain skills automatically.

ultrathink_review

from InugamiDev/ultrathink-oss

Multi-pass code review powered by UltraThink's quality gate — checks correctness, security (OWASP), performance, readability, and project conventions in a single structured pass.

ultrathink_memory

from InugamiDev/ultrathink-oss

Persistent memory system for UltraThink — search, save, and recall project context, decisions, and patterns across sessions using Postgres-backed fuzzy search with synonym expansion.

ui-design

from InugamiDev/ultrathink-oss

Comprehensive UI design system: 230+ font pairings, 48 themes, 65 design systems, 23 design languages, 30 UX laws, 14 color systems, Swiss grid, Gestalt principles, Pencil.dev workflow. Inherits ui-ux-pro-max (99 UX rules) + impeccable-frontend-design (anti-AI-slop). Triggers on any design, UI, layout, typography, color, theme, or styling task.

Zod

from InugamiDev/ultrathink-oss

> TypeScript-first schema validation with static type inference.

webinar-registration-page

from InugamiDev/ultrathink-oss

Build a webinar or live event registration page as a self-contained HTML file with countdown timer, speaker bio, agenda, and registration form. Triggers on: "build a webinar registration page", "create a webinar sign-up page", "event registration landing page", "live training registration page", "workshop sign-up page", "create a webinar page", "build an event page", "free webinar landing page", "live demo registration page", "online event page", "create a registration page for my webinar", "build a training event page".

webhooks

from InugamiDev/ultrathink-oss

Webhook design patterns — delivery, retry with exponential backoff, HMAC signature verification, payload validation, idempotency keys

web-workers

from InugamiDev/ultrathink-oss

Offload heavy computation from the main thread using Web Workers, SharedWorkers, and Comlink — structured messaging, transferable objects, and off-main-thread architecture patterns