prompt-caching

Caching strategies for LLM prompts including Anthropic prompt caching, response caching, and CAG (Cache Augmented Generation)

31,392 stars

Best use case

prompt-caching is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. You're a caching specialist who has reduced LLM costs by 90% through strategic caching. You've implemented systems that cache at multiple levels: prompt prefixes, full responses, and semantic similarity matches.

Caching strategies for LLM prompts including Anthropic prompt caching, response caching, and CAG (Cache Augmented Generation)

Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.

Practical example

Example input

Use the "prompt-caching" skill to help with this workflow task. Context: You're a caching specialist who has reduced LLM costs by 90% through strategic caching. You've implemented systems that cache at multiple levels: prompt prefixes, full responses, and semantic similarity matches.

Example output

A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.

When to use this skill

  • Use this skill when you want a reusable workflow rather than writing the same prompt again and again.

When not to use this skill

  • Do not use this when you only need a one-off answer and do not need a reusable workflow.
  • Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/prompt-caching/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/prompt-caching/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/prompt-caching/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How prompt-caching Compares

Feature / Agentprompt-cachingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Caching strategies for LLM prompts including Anthropic prompt caching, response caching, and CAG (Cache Augmented Generation)

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Prompt Caching

Caching strategies for LLM prompts including Anthropic prompt caching, response caching, and CAG (Cache Augmented Generation)

## Capabilities

- prompt-cache
- response-cache
- kv-cache
- cag-patterns
- cache-invalidation

## Prerequisites

- Knowledge: Caching fundamentals, LLM API usage, Hash functions
- Skills_recommended: context-window-management

## Scope

- Does_not_cover: CDN caching, Database query caching, Static asset caching
- Boundaries: Focus is LLM-specific caching, Covers prompt and response caching

## Ecosystem

### Primary_tools

- Anthropic Prompt Caching - Native prompt caching in Claude API
- Redis - In-memory cache for responses
- OpenAI Caching - Automatic caching in OpenAI API

## Patterns

### Anthropic Prompt Caching

Use Claude's native prompt caching for repeated prefixes

**When to use**: Using Claude API with stable system prompts or context

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// Cache the stable parts of your prompt
async function queryWithCaching(userQuery: string) {
    const response = await client.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 1024,
        system: [
            {
                type: "text",
                text: LONG_SYSTEM_PROMPT,  // Your detailed instructions
                cache_control: { type: "ephemeral" }  // Cache this!
            },
            {
                type: "text",
                text: KNOWLEDGE_BASE,  // Large static context
                cache_control: { type: "ephemeral" }
            }
        ],
        messages: [
            { role: "user", content: userQuery }  // Dynamic part
        ]
    });

    // Check cache usage
    console.log(`Cache read: ${response.usage.cache_read_input_tokens}`);
    console.log(`Cache write: ${response.usage.cache_creation_input_tokens}`);

    return response;
}

// Cost savings: 90% reduction on cached tokens
// Latency savings: Up to 2x faster

### Response Caching

Cache full LLM responses for identical or similar queries

**When to use**: Same queries asked repeatedly

import { createHash } from 'crypto';
import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);

class ResponseCache {
    private ttl = 3600;  // 1 hour default

    // Exact match caching
    async getCached(prompt: string): Promise<string | null> {
        const key = this.hashPrompt(prompt);
        return await redis.get(`response:${key}`);
    }

    async setCached(prompt: string, response: string): Promise<void> {
        const key = this.hashPrompt(prompt);
        await redis.set(`response:${key}`, response, 'EX', this.ttl);
    }

    private hashPrompt(prompt: string): string {
        return createHash('sha256').update(prompt).digest('hex');
    }

    // Semantic similarity caching
    async getSemanticallySimilar(
        prompt: string,
        threshold: number = 0.95
    ): Promise<string | null> {
        const embedding = await embed(prompt);
        const similar = await this.vectorCache.search(embedding, 1);

        if (similar.length && similar[0].similarity > threshold) {
            return await redis.get(`response:${similar[0].id}`);
        }
        return null;
    }

    // Temperature-aware caching
    async getCachedWithParams(
        prompt: string,
        params: { temperature: number; model: string }
    ): Promise<string | null> {
        // Only cache low-temperature responses
        if (params.temperature > 0.5) return null;

        const key = this.hashPrompt(
            `${prompt}|${params.model}|${params.temperature}`
        );
        return await redis.get(`response:${key}`);
    }
}

### Cache Augmented Generation (CAG)

Pre-cache documents in prompt instead of RAG retrieval

**When to use**: Document corpus is stable and fits in context

// CAG: Pre-compute document context, cache in prompt
// Better than RAG when:
// - Documents are stable
// - Total fits in context window
// - Latency is critical

class CAGSystem {
    private cachedContext: string | null = null;
    private lastUpdate: number = 0;

    async buildCachedContext(documents: Document[]): Promise<void> {
        // Pre-process and format documents
        const formatted = documents.map(d =>
            `## ${d.title}\n${d.content}`
        ).join('\n\n');

        // Store with timestamp
        this.cachedContext = formatted;
        this.lastUpdate = Date.now();
    }

    async query(userQuery: string): Promise<string> {
        // Use cached context directly in prompt
        const response = await client.messages.create({
            model: "claude-sonnet-4-20250514",
            max_tokens: 1024,
            system: [
                {
                    type: "text",
                    text: "You are a helpful assistant with access to the following documentation.",
                    cache_control: { type: "ephemeral" }
                },
                {
                    type: "text",
                    text: this.cachedContext!,  // Pre-cached docs
                    cache_control: { type: "ephemeral" }
                }
            ],
            messages: [{ role: "user", content: userQuery }]
        });

        return response.content[0].text;
    }

    // Periodic refresh
    async refreshIfNeeded(documents: Document[]): Promise<void> {
        const stale = Date.now() - this.lastUpdate > 3600000;  // 1 hour
        if (stale) {
            await this.buildCachedContext(documents);
        }
    }
}

// CAG vs RAG decision matrix:
// | Factor           | CAG Better | RAG Better |
// |------------------|------------|------------|
// | Corpus size      | < 100K tokens | > 100K tokens |
// | Update frequency | Low | High |
// | Latency needs    | Critical | Flexible |
// | Query specificity| General | Specific |

## Sharp Edges

### Cache miss causes latency spike with additional overhead

Severity: HIGH

Situation: Slow response when cache miss, slower than no caching

Symptoms:
- Slow responses on cache miss
- Cache hit rate below 50%
- Higher latency than uncached

Why this breaks:
Cache check adds latency.
Cache write adds more latency.
Miss + overhead > no caching.

Recommended fix:

// Optimize for cache misses, not just hits

class OptimizedCache {
    async queryWithCache(prompt: string): Promise<string> {
        const cacheKey = this.hash(prompt);

        // Non-blocking cache check
        const cachedPromise = this.cache.get(cacheKey);
        const llmPromise = this.queryLLM(prompt);

        // Race: use cache if available before LLM returns
        const cached = await Promise.race([
            cachedPromise,
            sleep(50).then(() => null)  // 50ms cache timeout
        ]);

        if (cached) {
            // Cancel LLM request if possible
            return cached;
        }

        // Cache miss: continue with LLM
        const response = await llmPromise;

        // Async cache write (don't block response)
        this.cache.set(cacheKey, response).catch(console.error);

        return response;
    }
}

// Alternative: Probabilistic caching
// Only cache if query matches known high-frequency patterns
class SelectiveCache {
    private patterns: Map<string, number> = new Map();

    shouldCache(prompt: string): boolean {
        const pattern = this.extractPattern(prompt);
        const frequency = this.patterns.get(pattern) || 0;

        // Only cache high-frequency patterns
        return frequency > 10;
    }

    recordQuery(prompt: string): void {
        const pattern = this.extractPattern(prompt);
        this.patterns.set(pattern, (this.patterns.get(pattern) || 0) + 1);
    }
}

### Cached responses become incorrect over time

Severity: HIGH

Situation: Users get outdated or wrong information from cache

Symptoms:
- Users report wrong information
- Answers don't match current data
- Complaints about outdated responses

Why this breaks:
Source data changed.
No cache invalidation.
Long TTLs for dynamic data.

Recommended fix:

// Implement proper cache invalidation

class InvalidatingCache {
    // Version-based invalidation
    private cacheVersion = 1;

    getCacheKey(prompt: string): string {
        return `v${this.cacheVersion}:${this.hash(prompt)}`;
    }

    invalidateAll(): void {
        this.cacheVersion++;
        // Old keys automatically become orphaned
    }

    // Content-hash invalidation
    async setWithContentHash(
        key: string,
        response: string,
        sourceContent: string
    ): Promise<void> {
        const contentHash = this.hash(sourceContent);
        await this.cache.set(key, {
            response,
            contentHash,
            timestamp: Date.now()
        });
    }

    async getIfValid(
        key: string,
        currentSourceContent: string
    ): Promise<string | null> {
        const cached = await this.cache.get(key);
        if (!cached) return null;

        // Check if source content changed
        const currentHash = this.hash(currentSourceContent);
        if (cached.contentHash !== currentHash) {
            await this.cache.delete(key);
            return null;
        }

        return cached.response;
    }

    // Event-based invalidation
    onSourceUpdate(sourceId: string): void {
        // Invalidate all caches that used this source
        this.invalidateByTag(`source:${sourceId}`);
    }
}

### Prompt caching doesn't work due to prefix changes

Severity: MEDIUM

Situation: Cache misses despite similar prompts

Symptoms:
- Cache hit rate lower than expected
- Cache creation tokens high, read low
- Similar prompts not hitting cache

Why this breaks:
Anthropic caching requires exact prefix match.
Timestamps or dynamic content in prefix.
Different message order.

Recommended fix:

// Structure prompts for optimal caching

class CacheOptimizedPrompts {
    // WRONG: Dynamic content in cached prefix
    buildPromptBad(query: string): SystemMessage[] {
        return [
            {
                type: "text",
                text: `You are helpful. Current time: ${new Date()}`,  // BREAKS CACHE!
                cache_control: { type: "ephemeral" }
            }
        ];
    }

    // RIGHT: Static prefix, dynamic at end
    buildPromptGood(query: string): SystemMessage[] {
        return [
            {
                type: "text",
                text: STATIC_SYSTEM_PROMPT,  // Never changes
                cache_control: { type: "ephemeral" }
            },
            {
                type: "text",
                text: STATIC_KNOWLEDGE_BASE,  // Rarely changes
                cache_control: { type: "ephemeral" }
            }
            // Dynamic content goes in messages, NOT system
        ];
    }

    // Prefix ordering matters
    buildWithConsistentOrder(components: string[]): SystemMessage[] {
        // Sort components for consistent ordering
        const sorted = [...components].sort();
        return sorted.map((c, i) => ({
            type: "text",
            text: c,
            cache_control: i === sorted.length - 1
                ? { type: "ephemeral" }
                : undefined  // Only cache the full prefix
        }));
    }
}

## Validation Checks

### Caching High Temperature Responses

Severity: WARNING

Message: Caching with high temperature. Responses are non-deterministic.

Fix action: Only cache responses with temperature <= 0.5

### Cache Without TTL

Severity: WARNING

Message: Cache without TTL. May serve stale data indefinitely.

Fix action: Set appropriate TTL based on data freshness requirements

### Dynamic Content in Cached Prefix

Severity: WARNING

Message: Dynamic content in cached prefix. Will cause cache misses.

Fix action: Move dynamic content outside of cache_control blocks

### No Cache Metrics

Severity: INFO

Message: Cache without hit/miss tracking. Can't measure effectiveness.

Fix action: Add cache hit/miss metrics and logging

## Collaboration

### Delegation Triggers

- context window|token -> context-window-management (Need context optimization)
- rag|retrieval -> rag-implementation (Need retrieval system)
- memory -> conversation-memory (Need memory persistence)

### High-Performance LLM System

Skills: prompt-caching, context-window-management, rag-implementation

Workflow:

```
1. Analyze query patterns
2. Implement prompt caching for stable prefixes
3. Add response caching for frequent queries
4. Consider CAG for stable document sets
5. Monitor and optimize hit rates
```

## Related Skills

Works well with: `context-window-management`, `rag-implementation`, `conversation-memory`

## When to Use
- User mentions or implies: prompt caching
- User mentions or implies: cache prompt
- User mentions or implies: response cache
- User mentions or implies: cag
- User mentions or implies: cache augmented

## Limitations
- Use this skill only when the task clearly matches the scope described above.
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

Related Skills

turborepo-caching

31392
from sickn33/antigravity-awesome-skills

Configure Turborepo for efficient monorepo builds with local and remote caching. Use when setting up Turborepo, optimizing build pipelines, or implementing distributed caching.

prompt-library

31392
from sickn33/antigravity-awesome-skills

A comprehensive collection of battle-tested prompts inspired by [awesome-chatgpt-prompts](https://github.com/f/awesome-chatgpt-prompts) and community best practices.

prompt-engineering

31392
from sickn33/antigravity-awesome-skills

Expert guide on prompt engineering patterns, best practices, and optimization techniques. Use when user wants to improve prompts, learn prompting strategies, or debug agent behavior.

prompt-engineering-patterns

31392
from sickn33/antigravity-awesome-skills

Master advanced prompt engineering techniques to maximize LLM performance, reliability, and controllability.

prompt-engineer

31392
from sickn33/antigravity-awesome-skills

Transforms user prompts into optimized prompts using frameworks (RTF, RISEN, Chain of Thought, RODES, Chain of Density, RACE, RISE, STAR, SOAP, CLEAR, GROW)

nextjs-best-practices

31392
from sickn33/antigravity-awesome-skills

Next.js App Router principles. Server Components, data fetching, routing patterns.

network-101

31392
from sickn33/antigravity-awesome-skills

Configure and test common network services (HTTP, HTTPS, SNMP, SMB) for penetration testing lab environments. Enable hands-on practice with service enumeration, log analysis, and security testing against properly configured target systems.

neon-postgres

31392
from sickn33/antigravity-awesome-skills

Expert patterns for Neon serverless Postgres, branching, connection pooling, and Prisma/Drizzle integration

nanobanana-ppt-skills

31392
from sickn33/antigravity-awesome-skills

AI-powered PPT generation with document analysis and styled images

multi-agent-patterns

31392
from sickn33/antigravity-awesome-skills

This skill should be used when the user asks to "design multi-agent system", "implement supervisor pattern", "create swarm architecture", "coordinate multiple agents", or mentions multi-agent patterns, context isolation, agent handoffs, sub-agents, or parallel agent execution.

monorepo-management

31392
from sickn33/antigravity-awesome-skills

Build efficient, scalable monorepos that enable code sharing, consistent tooling, and atomic changes across multiple packages and applications.

monetization

31392
from sickn33/antigravity-awesome-skills

Estrategia e implementacao de monetizacao para produtos digitais - Stripe, subscriptions, pricing experiments, freemium, upgrade flows, churn prevention, revenue optimization e modelos de negocio SaaS.