prompt-cache-optimizer
Optimize token usage through prompt caching and compression
Best use case
prompt-cache-optimizer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Optimize token usage through prompt caching and compression
Teams using prompt-cache-optimizer should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/prompt-cache-optimizer/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How prompt-cache-optimizer Compares
| Feature / Agent | prompt-cache-optimizer | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Optimize token usage through prompt caching and compression
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Prompt Cache Optimizer Skill
Reduces token costs by 50-90% through intelligent caching and compression.
## When to Activate
- Large context windows (>50K tokens)
- Repeated similar queries
- Long-running sessions
- Cost-conscious operations
## Optimization Layers
### Layer 1: Semantic Caching
```
Query → Embedding → Similarity Search → Cache Hit/Miss
↓ ↓
Vector Store Return cached or call LLM
```
Cache hits provide 100% token savings with near-instant response.
### Layer 2: Prompt Compression (LLMLingua-2)
- **Light**: 2-3x reduction, <5% accuracy impact
- **Moderate**: 5-7x reduction, 5-15% accuracy impact
- **Aggressive**: 10-20x reduction, requires validation
### Layer 3: Strategic Context Placement
Mitigate "lost in the middle" problem:
- Place most important information at START and END
- Middle content has 30-50% lower retention
### Layer 4: Hierarchical Memory Tiering
```
Working Memory (registers) → Always in context
FIFO Queue (L1/L2 cache) → Recent exchanges
Archival Memory (disk) → Semantic search only
```
## Implementation Workflow
1. **Check semantic cache** before any LLM call
2. **Compress context** using appropriate level
3. **Structure placement** - critical info at boundaries
4. **Tier management** - evict low-importance content
5. **Cache response** for future queries
## Compression Decision Matrix
| Context Size | Latency Need | Accuracy Need | Strategy |
|--------------|--------------|---------------|----------|
| <10K tokens | Any | Any | No compression |
| 10K-50K | Low | High | Light (2-3x) |
| 10K-50K | High | Medium | Moderate (5-7x) |
| 50K-100K | Any | Medium | Aggressive (10-20x) |
| >100K | Any | Any | Hierarchical + Aggressive |
## Key Patterns
### Attention Sink Preservation
For streaming/long sessions, preserve first 4 tokens as attention sinks:
```
[attention_sinks (4 tokens)] + [rolling_window (window - 4)]
```
This maintains model coherence over infinite context.
### Hybrid Search for RAG
```
Hybrid = Dense (semantic) + Sparse (BM25)
Fusion = Reciprocal Rank Fusion (RRF)
```
Achieves 50-100x document reduction with maintained relevance.
## Metrics to Track
- Cache hit rate (target: >60%)
- Compression ratio achieved
- Accuracy impact (sample validation)
- Token savings per session
- Latency impact
## Integration Points
- Pre-prompt: Apply compression
- Post-response: Cache result
- Session start: Load cached context
- Memory pressure: Tier eviction
---
*Based on LLMLingua, GPTCache, MemGPT, and StreamingLLM research*Related Skills
meta-prompting
Self-improving prompts through meta-level optimization
judge-prompt
Design binary pass/fail LLM-as-Judge evaluators. Structured prompt engineering for evaluation: criteria definition, rubric construction, few-shot calibration, and bias mitigation. Produces a ready-to-deploy judge prompt with scoring instructions. Triggers on: "judge prompt", "llm judge", "evaluator prompt", "scoring prompt", "grading rubric"
/do
> The agent's primary skill. Customize this to match your agent's purpose.
/report
> Generate structured reports. Director-owned.
/primary
> Main workflow execution and routing. Director-owned.
Qualify
## Command
Prospect
## Command
Close Plan
## Command
Battlecard
## Command
Spec
## Command
Schedule
## Command
Repurpose
## Command