context-engineering

Use when designing agent system prompts, optimizing RAG retrieval, or when context is too expensive or slow. Reduces tokens while maintaining quality through strategic positioning and attention-aware design.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

context-engineering is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using context-engineering should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/context-engineering/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/context-engineering/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/context-engineering/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How context-engineering Compares

Feature / Agent	context-engineering	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Context Engineering

**The discipline of curating the smallest high-signal token set that achieves desired outcomes.**

## Overview

Context engineering goes beyond prompt engineering. While prompts focus on *what* you ask, context engineering focuses on *everything* the model sees—system instructions, tool definitions, documents, message history, and tool outputs.

**Key Insight:** Context windows are constrained not by raw token capacity but by attention mechanics. As context grows, models experience degradation.

## Overview

- Designing agent system prompts
- Optimizing RAG retrieval pipelines
- Managing long-running conversations
- Building multi-agent architectures
- Reducing token costs while maintaining quality

---

## The "Lost in the Middle" Phenomenon

Models pay unequal attention across the context window:

```
Attention
Strength   ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████
           ↑                                                      ↑
        START              MIDDLE (weakest attention)           END
```

**Practical Implications:**

| Position | Attention | Best For |
|----------|-----------|----------|
| START | High | System identity, critical instructions, constraints |
| MIDDLE | Low | Background context, optional details |
| END | High | Current task, recent messages, immediate query |

---

## The Five Context Layers

### 1. System Prompts (Identity Layer)

Establishes agent identity at the right "altitude":

```
TOO HIGH (vague):        "You are a helpful assistant"
TOO LOW (brittle):       "Always respond with exactly 3 bullet points..."
OPTIMAL (principled):    "You are a senior engineer who values clarity,
                          tests assumptions, and explains trade-offs"
```

**Best Practices:**
- Define role and expertise level
- State core principles (not rigid rules)
- Include what NOT to do (boundaries)
- Position at START of context

### 2. Tool Definitions (Capability Layer)

Tools steer behavior through descriptions:

```python
# ❌ BAD: Ambiguous - when would you use this?
@tool
def search(query: str) -> str:
    """Search for information."""
    pass

# ✅ GOOD: Clear trigger conditions
@tool
def search_documentation(query: str) -> str:
    """
    Search internal documentation for technical answers.

    USE WHEN:
    - User asks about internal APIs or services
    - Question requires company-specific knowledge
    - Public information is insufficient

    DO NOT USE WHEN:
    - Question is general programming knowledge
    - User explicitly wants external sources
    """
    pass
```

**Rule:** If a human cannot definitively say which tool to use, an agent cannot either.

### 3. Retrieved Documents (Knowledge Layer)

Just-in-time loading beats pre-loading:

```python
# ❌ BAD: Pre-load everything
context = load_all_documentation()  # 50k tokens!

# ✅ GOOD: Progressive disclosure
def build_context(query: str) -> str:
    # Stage 1: Lightweight retrieval (500 tokens)
    summaries = search_summaries(query, top_k=5)

    # Stage 2: Selective deep loading (only if needed)
    if needs_detail(summaries):
        full_docs = load_full_documents(summaries[:2])
        return summaries + full_docs

    return summaries
```

### 4. Message History (Memory Layer)

Treat as scratchpad, not permanent storage:

```python
# Implement sliding window with compression
MAX_MESSAGES = 20
COMPRESSION_TRIGGER = 0.7  # 70% of context budget

def manage_history(messages: list, budget: int) -> list:
    current_tokens = count_tokens(messages)

    if current_tokens > budget * COMPRESSION_TRIGGER:
        # Compress older messages, keep recent
        old = messages[:-5]
        recent = messages[-5:]

        summary = summarize(old)  # Anchored compression
        return [summary] + recent

    return messages
```

### 5. Tool Outputs (Observation Layer)

**Critical Finding:** Tool outputs can reach 83.9% of total context usage!

```python
# ❌ BAD: Return raw output
def search_web(query: str) -> str:
    results = web_search(query)
    return json.dumps(results)  # Could be 10k+ tokens!

# ✅ GOOD: Structured, bounded output
def search_web(query: str) -> str:
    results = web_search(query)

    # Extract only what's needed
    extracted = [
        {
            "title": r["title"],
            "snippet": r["snippet"][:200],  # Truncate
            "url": r["url"]
        }
        for r in results[:5]  # Limit count
    ]

    return json.dumps(extracted)  # ~500 tokens max
```

---

## The 95% Finding

Research shows what actually drives agent performance:

```
┌────────────────────────────────────────────────────────────────┐
│  TOKEN USAGE        ████████████████████████████████████  80%  │
│  TOOL CALLS         █████  10%                                 │
│  MODEL CHOICE       ██  5%                                     │
│  OTHER              ██  5%                                     │
└────────────────────────────────────────────────────────────────┘
```

**Key Insight:** Optimize context efficiency BEFORE switching models.

---

## Context Budget Management

### Token Budget Calculator

```python
def calculate_budget(model: str, task_type: str) -> dict:
    """Calculate optimal token allocation."""

    MAX_CONTEXT = {
        "gpt-4o": 128_000,
        "claude-3": 200_000,
        "llama-3": 128_000,
    }

    # Reserve 20% for response generation
    available = MAX_CONTEXT[model] * 0.8

    # Allocation by task type
    ALLOCATIONS = {
        "chat": {
            "system": 0.05,      # 5%
            "tools": 0.05,       # 5%
            "history": 0.60,    # 60%
            "retrieval": 0.20,  # 20%
            "current": 0.10,    # 10%
        },
        "agent": {
            "system": 0.10,     # 10%
            "tools": 0.15,      # 15%
            "history": 0.30,    # 30%
            "retrieval": 0.25,  # 25%
            "observations": 0.20, # 20%
        },
    }

    alloc = ALLOCATIONS[task_type]
    return {k: int(v * available) for k, v in alloc.items()}
```

### Compression Triggers

```python
COMPRESSION_CONFIG = {
    "trigger_threshold": 0.70,    # Start compressing at 70%
    "target_threshold": 0.50,     # Compress down to 50%
    "preserve_recent": 5,         # Always keep last 5 messages
    "preserve_system": True,      # Never compress system prompt
}
```

---

## Attention-Aware Positioning

### Template Structure

```markdown
[START - HIGH ATTENTION]
## System Identity
You are a {role} specialized in {domain}.

## Critical Constraints
- NEVER {dangerous_action}
- ALWAYS {required_behavior}

[MIDDLE - LOWER ATTENTION]
## Background Context
{retrieved_documents}
{older_conversation_history}

[END - HIGH ATTENTION]
## Current Task
{recent_messages}
{user_query}

## Response Guidelines
{output_format_instructions}
```

### Priority Positioning Rules

1. **Identity & Constraints** → START (immutable)
2. **Critical instructions** → START or END
3. **Retrieved documents** → MIDDLE (expandable)
4. **Conversation history** → MIDDLE (compressible)
5. **Current query** → END (always visible)
6. **Output format** → END (guides generation)

---

## Metrics: Tokens-Per-Task

**Optimize for total task completion, not individual requests:**

```python
@dataclass
class TaskMetrics:
    task_id: str
    total_tokens: int = 0
    request_count: int = 0
    retrieval_tokens: int = 0
    generation_tokens: int = 0

    @property
    def tokens_per_request(self) -> float:
        return self.total_tokens / max(self.request_count, 1)

    @property
    def efficiency_ratio(self) -> float:
        """Lower is better - generation vs total context."""
        return self.generation_tokens / max(self.total_tokens, 1)
```

**Anti-pattern:** Aggressive compression that loses critical details forces expensive re-fetching, consuming MORE tokens overall.

---

## Common Pitfalls

| Pitfall | Problem | Solution |
|---------|---------|----------|
| Token stuffing | "More context = better" | Quality over quantity |
| Flat structure | No priority signaling | Use headers, positioning |
| Static context | Same context for all queries | Dynamic, query-relevant retrieval |
| Ignoring middle | Important info gets lost | Position critically |
| No compression | Context grows unbounded | Sliding window + summarization |

---

## Integration with OrchestKit

### Agent System Prompts

Apply attention-aware positioning to agent definitions:

```markdown
# Agent: backend-system-architect

[HIGH ATTENTION - START]
## Identity
Senior backend architect with 15+ years experience.

## Constraints
- NEVER suggest unvalidated security patterns
- ALWAYS consider multi-tenant isolation

[LOWER ATTENTION - MIDDLE]
## Domain Knowledge
{dynamically_loaded_patterns}

[HIGH ATTENTION - END]
## Current Task
{user_request}
```

### Skill Loading

Progressive skill disclosure:

```python
# Stage 1: Load skill metadata only (~100 tokens)
skill_index = load_skill_summaries()

# Stage 2: Load relevant skill on demand (~500 tokens)
if task_matches("database"):
    full_skill = load_skill("pgvector-search")
```

---

---

## CC 2.1.7: MCP Auto-Discovery and Deferral

### MCP Search Mode

CC 2.1.7 introduces intelligent MCP tool discovery. When context usage exceeds 10% of the effective window, MCPs are automatically deferred to reduce token overhead.

```
Context < 10%:  MCP tools immediately available
Context > 10%:  MCP tools discovered via MCPSearch (deferred loading)

Savings: ~7200 tokens per session average
```

### How Auto-Deferral Works

The context budget monitor tracks usage against the effective window:

1. **Below 10%**: MCP tool definitions loaded in context (~1200 tokens)
2. **Above 10%**: MCP tools deferred, available via MCPSearch on-demand
3. **State file**: `/tmp/claude-mcp-defer-state-{session}.json`

### Best Practices for MCP with Auto-Deferral

1. **Use MCPs early** - Before context fills up
2. **Batch MCP calls** - Multiple queries in one turn
3. **Cache MCP results** - Store retrieved docs in context
4. **Monitor statusline** - Watch for `mcp.deferred: true`

### Checking MCP Deferral State

```bash
cat /tmp/claude-mcp-defer-state-${CLAUDE_SESSION_ID}.json
```


## Related Skills

- `context-compression` - Compression strategies and anchored summarization
- `multi-agent-orchestration` - Context isolation across agents
- `rag-retrieval` - Optimizing retrieved document context
- `prompt-caching` - Reducing redundant context transmission

---

**Version:** 1.0.0 (January 2026)
**Based on:** Context Engineering research, BrowseComp evaluation findings
**Key Metric:** 80% of agent performance variance explained by token usage

## Capability Details

### attention-mechanics
**Keywords:** context window, attention, lost in the middle, token budget
**Solves:**
- Understand lost-in-the-middle effect (high attention at START/END)
- Position critical info strategically
- Optimize tokens-per-task not tokens-per-request

### context-layers
**Keywords:** context anatomy, context structure, five layers
**Solves:**
- Understand 5 context layers (system, tools, docs, history, outputs)
- Implement just-in-time document loading
- Manage tool output truncation

### budget-allocation
**Keywords:** token budget, context budget, allocation
**Solves:**
- Allocate tokens across context layers
- Implement compression triggers at 70% utilization
- Target 50% utilization after compression

Related Skills

extracting-ai-context

from diegosouzapw/awesome-omni-skill

Extracts and manages AI context (skills, AGENTS.md) from workflow-kotlin library JARs. Use when setting up AI tooling for a workflow-kotlin project, updating skills after a library version change, or configuring agent-specific directories.

data-engineering-data-pipeline

from diegosouzapw/awesome-omni-skill

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

create-agent-with-sanity-context

from diegosouzapw/awesome-omni-skill

Build AI agents with structured access to Sanity content via Context MCP. Covers Studio setup, agent implementation, and advanced patterns like client-side tools and custom rendering.

context-optimizer

from diegosouzapw/awesome-omni-skill

Analyzes Copilot Chat debug logs, agent definitions, skills, and instruction files to audit context window utilization. Provides log parsing, turn-cost profiling, redundancy detection, hand-off gap analysis, and optimization recommendations. Use when optimizing agent context efficiency, identifying where to add subagent hand-offs, or reducing token waste across agent systems.

context-fundamentals

from diegosouzapw/awesome-omni-skill

Understand the components, mechanics, and constraints of context in agent systems. Use when designing agent architectures, debugging context-related failures, or optimizing context usage.

context-degradation

from diegosouzapw/awesome-omni-skill

Recognize patterns of context failure: lost-in-middle, poisoning, distraction, and clash

context-assembler

from diegosouzapw/awesome-omni-skill

Assembles relevant context for agent spawns with prioritized ranking. Ranks packages by relevance, enforces token budgets with graduated zones, captures error patterns for learning, and supports configurable per-agent retrieval limits.

Codebase context

from diegosouzapw/awesome-omni-skill

Create a lightweight codebase_context.md that anchors the idea in the existing repo (modules, constraints, extension points). Generic framework prompt.

Build Your Data Engineering Skill

from diegosouzapw/awesome-omni-skill

Create your LLMOps data engineering skill in one prompt, then learn to improve it throughout the chapter

ai-engineering-skill

from diegosouzapw/awesome-omni-skill

Practical guide for building production ML systems based on Chip Huyen's AI Engineering book. Use when users ask about model evaluation, deployment strategies, monitoring, data pipelines, feature engineering, cost optimization, or MLOps. Covers metrics, A/B testing, serving patterns, drift detection, and production best practices.

ai-data-engineering

from diegosouzapw/awesome-omni-skill

Data pipelines, feature stores, and embedding generation for AI/ML systems. Use when building RAG pipelines, ML feature serving, or data transformations. Covers feature stores (Feast, Tecton), embedding pipelines, chunking strategies, orchestration (Dagster, Prefect, Airflow), dbt transformations, data versioning (LakeFS), and experiment tracking (MLflow, W&B).

agent-context-system

from diegosouzapw/awesome-omni-skill

A persistent local-only memory system for AI coding agents. Two files, one idea — AGENTS.md (committed, shared) + .agents.local.md (gitignored, personal). Agents read both at session start, update the scratchpad at session end, and promote stable patterns over time. Works across Claude Code, Cursor, Copilot, Windsurf. Subagent-ready. No plugins, no infrastructure, no background processes.