prompt-cache-optimizer

Optimize token usage through prompt caching and compression

170 stars

Best use case

prompt-cache-optimizer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Optimize token usage through prompt caching and compression

Teams using prompt-cache-optimizer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/prompt-cache-optimizer/SKILL.md --create-dirs "https://raw.githubusercontent.com/Miosa-osa/canopy/main/library/skills/ai-patterns/prompt-cache-optimizer/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/prompt-cache-optimizer/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How prompt-cache-optimizer Compares

Feature / Agentprompt-cache-optimizerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Optimize token usage through prompt caching and compression

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Prompt Cache Optimizer Skill

Reduces token costs by 50-90% through intelligent caching and compression.

## When to Activate

- Large context windows (>50K tokens)
- Repeated similar queries
- Long-running sessions
- Cost-conscious operations

## Optimization Layers

### Layer 1: Semantic Caching
```
Query → Embedding → Similarity Search → Cache Hit/Miss
         ↓              ↓
    Vector Store    Return cached or call LLM
```

Cache hits provide 100% token savings with near-instant response.

### Layer 2: Prompt Compression (LLMLingua-2)
- **Light**: 2-3x reduction, <5% accuracy impact
- **Moderate**: 5-7x reduction, 5-15% accuracy impact
- **Aggressive**: 10-20x reduction, requires validation

### Layer 3: Strategic Context Placement
Mitigate "lost in the middle" problem:
- Place most important information at START and END
- Middle content has 30-50% lower retention

### Layer 4: Hierarchical Memory Tiering
```
Working Memory (registers)  → Always in context
FIFO Queue (L1/L2 cache)    → Recent exchanges
Archival Memory (disk)      → Semantic search only
```

## Implementation Workflow

1. **Check semantic cache** before any LLM call
2. **Compress context** using appropriate level
3. **Structure placement** - critical info at boundaries
4. **Tier management** - evict low-importance content
5. **Cache response** for future queries

## Compression Decision Matrix

| Context Size | Latency Need | Accuracy Need | Strategy |
|--------------|--------------|---------------|----------|
| <10K tokens  | Any          | Any           | No compression |
| 10K-50K      | Low          | High          | Light (2-3x) |
| 10K-50K      | High         | Medium        | Moderate (5-7x) |
| 50K-100K     | Any          | Medium        | Aggressive (10-20x) |
| >100K        | Any          | Any           | Hierarchical + Aggressive |

## Key Patterns

### Attention Sink Preservation
For streaming/long sessions, preserve first 4 tokens as attention sinks:
```
[attention_sinks (4 tokens)] + [rolling_window (window - 4)]
```
This maintains model coherence over infinite context.

### Hybrid Search for RAG
```
Hybrid = Dense (semantic) + Sparse (BM25)
Fusion = Reciprocal Rank Fusion (RRF)
```
Achieves 50-100x document reduction with maintained relevance.

## Metrics to Track

- Cache hit rate (target: >60%)
- Compression ratio achieved
- Accuracy impact (sample validation)
- Token savings per session
- Latency impact

## Integration Points

- Pre-prompt: Apply compression
- Post-response: Cache result
- Session start: Load cached context
- Memory pressure: Tier eviction

---

*Based on LLMLingua, GPTCache, MemGPT, and StreamingLLM research*