llm-app-patterns
Production-ready patterns for building LLM applications, inspired by [Dify](https://github.com/langgenius/dify) and industry best practices.
Best use case
llm-app-patterns is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Production-ready patterns for building LLM applications, inspired by [Dify](https://github.com/langgenius/dify) and industry best practices.
Production-ready patterns for building LLM applications, inspired by [Dify](https://github.com/langgenius/dify) and industry best practices.
Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.
Practical example
Example input
Use the "llm-app-patterns" skill to help with this workflow task. Context: Production-ready patterns for building LLM applications, inspired by [Dify](https://github.com/langgenius/dify) and industry best practices.
Example output
A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.
When to use this skill
- Use this skill when you want a reusable workflow rather than writing the same prompt again and again.
When not to use this skill
- Do not use this when you only need a one-off answer and do not need a reusable workflow.
- Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/llm-app-patterns/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How llm-app-patterns Compares
| Feature / Agent | llm-app-patterns | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Production-ready patterns for building LLM applications, inspired by [Dify](https://github.com/langgenius/dify) and industry best practices.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
AI Agent for Product Research
Browse AI agent skills for product research, competitive analysis, customer discovery, and structured product decision support.
AI Agent for SaaS Idea Validation
Use AI agent skills for SaaS idea validation, market research, customer discovery, competitor analysis, and documenting startup hypotheses.
SKILL.md Source
# 🤖 LLM Application Patterns
> Production-ready patterns for building LLM applications, inspired by [Dify](https://github.com/langgenius/dify) and industry best practices.
## When to Use This Skill
Use this skill when:
- Designing LLM-powered applications
- Implementing RAG (Retrieval-Augmented Generation)
- Building AI agents with tools
- Setting up LLMOps monitoring
- Choosing between agent architectures
---
## 1. RAG Pipeline Architecture
### Overview
RAG (Retrieval-Augmented Generation) grounds LLM responses in your data.
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Ingest │────▶│ Retrieve │────▶│ Generate │
│ Documents │ │ Context │ │ Response │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌───────────┐ ┌───────────┐
│ Chunking│ │ Vector │ │ LLM │
│Embedding│ │ Search │ │ + Context│
└─────────┘ └───────────┘ └───────────┘
```
### 1.1 Document Ingestion
```python
# Chunking strategies
class ChunkingStrategy:
# Fixed-size chunks (simple but may break context)
FIXED_SIZE = "fixed_size" # e.g., 512 tokens
# Semantic chunking (preserves meaning)
SEMANTIC = "semantic" # Split on paragraphs/sections
# Recursive splitting (tries multiple separators)
RECURSIVE = "recursive" # ["\n\n", "\n", " ", ""]
# Document-aware (respects structure)
DOCUMENT_AWARE = "document_aware" # Headers, lists, etc.
# Recommended settings
CHUNK_CONFIG = {
"chunk_size": 512, # tokens
"chunk_overlap": 50, # token overlap between chunks
"separators": ["\n\n", "\n", ". ", " "],
}
```
### 1.2 Embedding & Storage
```python
# Vector database selection
VECTOR_DB_OPTIONS = {
"pinecone": {
"use_case": "Production, managed service",
"scale": "Billions of vectors",
"features": ["Hybrid search", "Metadata filtering"]
},
"weaviate": {
"use_case": "Self-hosted, multi-modal",
"scale": "Millions of vectors",
"features": ["GraphQL API", "Modules"]
},
"chromadb": {
"use_case": "Development, prototyping",
"scale": "Thousands of vectors",
"features": ["Simple API", "In-memory option"]
},
"pgvector": {
"use_case": "Existing Postgres infrastructure",
"scale": "Millions of vectors",
"features": ["SQL integration", "ACID compliance"]
}
}
# Embedding model selection
EMBEDDING_MODELS = {
"openai/text-embedding-3-small": {
"dimensions": 1536,
"cost": "$0.02/1M tokens",
"quality": "Good for most use cases"
},
"openai/text-embedding-3-large": {
"dimensions": 3072,
"cost": "$0.13/1M tokens",
"quality": "Best for complex queries"
},
"local/bge-large": {
"dimensions": 1024,
"cost": "Free (compute only)",
"quality": "Comparable to OpenAI small"
}
}
```
### 1.3 Retrieval Strategies
```python
# Basic semantic search
def semantic_search(query: str, top_k: int = 5):
query_embedding = embed(query)
results = vector_db.similarity_search(
query_embedding,
top_k=top_k
)
return results
# Hybrid search (semantic + keyword)
def hybrid_search(query: str, top_k: int = 5, alpha: float = 0.5):
"""
alpha=1.0: Pure semantic
alpha=0.0: Pure keyword (BM25)
alpha=0.5: Balanced
"""
semantic_results = vector_db.similarity_search(query)
keyword_results = bm25_search(query)
# Reciprocal Rank Fusion
return rrf_merge(semantic_results, keyword_results, alpha)
# Multi-query retrieval
def multi_query_retrieval(query: str):
"""Generate multiple query variations for better recall"""
queries = llm.generate_query_variations(query, n=3)
all_results = []
for q in queries:
all_results.extend(semantic_search(q))
return deduplicate(all_results)
# Contextual compression
def compressed_retrieval(query: str):
"""Retrieve then compress to relevant parts only"""
docs = semantic_search(query, top_k=10)
compressed = llm.extract_relevant_parts(docs, query)
return compressed
```
### 1.4 Generation with Context
```python
RAG_PROMPT_TEMPLATE = """
Answer the user's question based ONLY on the following context.
If the context doesn't contain enough information, say "I don't have enough information to answer that."
Context:
{context}
Question: {question}
Answer:"""
def generate_with_rag(question: str):
# Retrieve
context_docs = hybrid_search(question, top_k=5)
context = "\n\n".join([doc.content for doc in context_docs])
# Generate
prompt = RAG_PROMPT_TEMPLATE.format(
context=context,
question=question
)
response = llm.generate(prompt)
# Return with citations
return {
"answer": response,
"sources": [doc.metadata for doc in context_docs]
}
```
---
## 2. Agent Architectures
### 2.1 ReAct Pattern (Reasoning + Acting)
```
Thought: I need to search for information about X
Action: search("X")
Observation: [search results]
Thought: Based on the results, I should...
Action: calculate(...)
Observation: [calculation result]
Thought: I now have enough information
Action: final_answer("The answer is...")
```
```python
REACT_PROMPT = """
You are an AI assistant that can use tools to answer questions.
Available tools:
{tools_description}
Use this format:
Thought: [your reasoning about what to do next]
Action: [tool_name(arguments)]
Observation: [tool result - this will be filled in]
... (repeat Thought/Action/Observation as needed)
Thought: I have enough information to answer
Final Answer: [your final response]
Question: {question}
"""
class ReActAgent:
def __init__(self, tools: list, llm):
self.tools = {t.name: t for t in tools}
self.llm = llm
self.max_iterations = 10
def run(self, question: str) -> str:
prompt = REACT_PROMPT.format(
tools_description=self._format_tools(),
question=question
)
for _ in range(self.max_iterations):
response = self.llm.generate(prompt)
if "Final Answer:" in response:
return self._extract_final_answer(response)
action = self._parse_action(response)
observation = self._execute_tool(action)
prompt += f"\nObservation: {observation}\n"
return "Max iterations reached"
```
### 2.2 Function Calling Pattern
```python
# Define tools as functions with schemas
TOOLS = [
{
"name": "search_web",
"description": "Search the web for current information",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query"
}
},
"required": ["query"]
}
},
{
"name": "calculate",
"description": "Perform mathematical calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "Math expression to evaluate"
}
},
"required": ["expression"]
}
}
]
class FunctionCallingAgent:
def run(self, question: str) -> str:
messages = [{"role": "user", "content": question}]
while True:
response = self.llm.chat(
messages=messages,
tools=TOOLS,
tool_choice="auto"
)
if response.tool_calls:
for tool_call in response.tool_calls:
result = self._execute_tool(
tool_call.name,
tool_call.arguments
)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result)
})
else:
return response.content
```
### 2.3 Plan-and-Execute Pattern
```python
class PlanAndExecuteAgent:
"""
1. Create a plan (list of steps)
2. Execute each step
3. Replan if needed
"""
def run(self, task: str) -> str:
# Planning phase
plan = self.planner.create_plan(task)
# Returns: ["Step 1: ...", "Step 2: ...", ...]
results = []
for step in plan:
# Execute each step
result = self.executor.execute(step, context=results)
results.append(result)
# Check if replan needed
if self._needs_replan(task, results):
new_plan = self.planner.replan(
task,
completed=results,
remaining=plan[len(results):]
)
plan = new_plan
# Synthesize final answer
return self.synthesizer.summarize(task, results)
```
### 2.4 Multi-Agent Collaboration
```python
class AgentTeam:
"""
Specialized agents collaborating on complex tasks
"""
def __init__(self):
self.agents = {
"researcher": ResearchAgent(),
"analyst": AnalystAgent(),
"writer": WriterAgent(),
"critic": CriticAgent()
}
self.coordinator = CoordinatorAgent()
def solve(self, task: str) -> str:
# Coordinator assigns subtasks
assignments = self.coordinator.decompose(task)
results = {}
for assignment in assignments:
agent = self.agents[assignment.agent]
result = agent.execute(
assignment.subtask,
context=results
)
results[assignment.id] = result
# Critic reviews
critique = self.agents["critic"].review(results)
if critique.needs_revision:
# Iterate with feedback
return self.solve_with_feedback(task, results, critique)
return self.coordinator.synthesize(results)
```
---
## 3. Prompt IDE Patterns
### 3.1 Prompt Templates with Variables
```python
class PromptTemplate:
def __init__(self, template: str, variables: list[str]):
self.template = template
self.variables = variables
def format(self, **kwargs) -> str:
# Validate all variables provided
missing = set(self.variables) - set(kwargs.keys())
if missing:
raise ValueError(f"Missing variables: {missing}")
return self.template.format(**kwargs)
def with_examples(self, examples: list[dict]) -> str:
"""Add few-shot examples"""
example_text = "\n\n".join([
f"Input: {ex['input']}\nOutput: {ex['output']}"
for ex in examples
])
return f"{example_text}\n\n{self.template}"
# Usage
summarizer = PromptTemplate(
template="Summarize the following text in {style} style:\n\n{text}",
variables=["style", "text"]
)
prompt = summarizer.format(
style="professional",
text="Long article content..."
)
```
### 3.2 Prompt Versioning & A/B Testing
```python
class PromptRegistry:
def __init__(self, db):
self.db = db
def register(self, name: str, template: str, version: str):
"""Store prompt with version"""
self.db.save({
"name": name,
"template": template,
"version": version,
"created_at": datetime.now(),
"metrics": {}
})
def get(self, name: str, version: str = "latest") -> str:
"""Retrieve specific version"""
return self.db.get(name, version)
def ab_test(self, name: str, user_id: str) -> str:
"""Return variant based on user bucket"""
variants = self.db.get_all_versions(name)
bucket = hash(user_id) % len(variants)
return variants[bucket]
def record_outcome(self, prompt_id: str, outcome: dict):
"""Track prompt performance"""
self.db.update_metrics(prompt_id, outcome)
```
### 3.3 Prompt Chaining
```python
class PromptChain:
"""
Chain prompts together, passing output as input to next
"""
def __init__(self, steps: list[dict]):
self.steps = steps
def run(self, initial_input: str) -> dict:
context = {"input": initial_input}
results = []
for step in self.steps:
prompt = step["prompt"].format(**context)
output = llm.generate(prompt)
# Parse output if needed
if step.get("parser"):
output = step"parser"
context[step["output_key"]] = output
results.append({
"step": step["name"],
"output": output
})
return {
"final_output": context[self.steps[-1]["output_key"]],
"intermediate_results": results
}
# Example: Research → Analyze → Summarize
chain = PromptChain([
{
"name": "research",
"prompt": "Research the topic: {input}",
"output_key": "research"
},
{
"name": "analyze",
"prompt": "Analyze these findings:\n{research}",
"output_key": "analysis"
},
{
"name": "summarize",
"prompt": "Summarize this analysis in 3 bullet points:\n{analysis}",
"output_key": "summary"
}
])
```
---
## 4. LLMOps & Observability
### 4.1 Metrics to Track
```python
LLM_METRICS = {
# Performance
"latency_p50": "50th percentile response time",
"latency_p99": "99th percentile response time",
"tokens_per_second": "Generation speed",
# Quality
"user_satisfaction": "Thumbs up/down ratio",
"task_completion": "% tasks completed successfully",
"hallucination_rate": "% responses with factual errors",
# Cost
"cost_per_request": "Average $ per API call",
"tokens_per_request": "Average tokens used",
"cache_hit_rate": "% requests served from cache",
# Reliability
"error_rate": "% failed requests",
"timeout_rate": "% requests that timed out",
"retry_rate": "% requests needing retry"
}
```
### 4.2 Logging & Tracing
```python
import logging
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
class LLMLogger:
def log_request(self, request_id: str, data: dict):
"""Log LLM request for debugging and analysis"""
log_entry = {
"request_id": request_id,
"timestamp": datetime.now().isoformat(),
"model": data["model"],
"prompt": data["prompt"][:500], # Truncate for storage
"prompt_tokens": data["prompt_tokens"],
"temperature": data.get("temperature", 1.0),
"user_id": data.get("user_id"),
}
logging.info(f"LLM_REQUEST: {json.dumps(log_entry)}")
def log_response(self, request_id: str, data: dict):
"""Log LLM response"""
log_entry = {
"request_id": request_id,
"completion_tokens": data["completion_tokens"],
"total_tokens": data["total_tokens"],
"latency_ms": data["latency_ms"],
"finish_reason": data["finish_reason"],
"cost_usd": self._calculate_cost(data),
}
logging.info(f"LLM_RESPONSE: {json.dumps(log_entry)}")
# Distributed tracing
@tracer.start_as_current_span("llm_call")
def call_llm(prompt: str) -> str:
span = trace.get_current_span()
span.set_attribute("prompt.length", len(prompt))
response = llm.generate(prompt)
span.set_attribute("response.length", len(response))
span.set_attribute("tokens.total", response.usage.total_tokens)
return response.content
```
### 4.3 Evaluation Framework
```python
class LLMEvaluator:
"""
Evaluate LLM outputs for quality
"""
def evaluate_response(self,
question: str,
response: str,
ground_truth: str = None) -> dict:
scores = {}
# Relevance: Does it answer the question?
scores["relevance"] = self._score_relevance(question, response)
# Coherence: Is it well-structured?
scores["coherence"] = self._score_coherence(response)
# Groundedness: Is it based on provided context?
scores["groundedness"] = self._score_groundedness(response)
# Accuracy: Does it match ground truth?
if ground_truth:
scores["accuracy"] = self._score_accuracy(response, ground_truth)
# Harmfulness: Is it safe?
scores["safety"] = self._score_safety(response)
return scores
def run_benchmark(self, test_cases: list[dict]) -> dict:
"""Run evaluation on test set"""
results = []
for case in test_cases:
response = llm.generate(case["prompt"])
scores = self.evaluate_response(
question=case["prompt"],
response=response,
ground_truth=case.get("expected")
)
results.append(scores)
return self._aggregate_scores(results)
```
---
## 5. Production Patterns
### 5.1 Caching Strategy
```python
import hashlib
from functools import lru_cache
class LLMCache:
def __init__(self, redis_client, ttl_seconds=3600):
self.redis = redis_client
self.ttl = ttl_seconds
def _cache_key(self, prompt: str, model: str, **kwargs) -> str:
"""Generate deterministic cache key"""
content = f"{model}:{prompt}:{json.dumps(kwargs, sort_keys=True)}"
return hashlib.sha256(content.encode()).hexdigest()
def get_or_generate(self, prompt: str, model: str, **kwargs) -> str:
key = self._cache_key(prompt, model, **kwargs)
# Check cache
cached = self.redis.get(key)
if cached:
return cached.decode()
# Generate
response = llm.generate(prompt, model=model, **kwargs)
# Cache (only cache deterministic outputs)
if kwargs.get("temperature", 1.0) == 0:
self.redis.setex(key, self.ttl, response)
return response
```
### 5.2 Rate Limiting & Retry
```python
import time
from tenacity import retry, wait_exponential, stop_after_attempt
class RateLimiter:
def __init__(self, requests_per_minute: int):
self.rpm = requests_per_minute
self.timestamps = []
def acquire(self):
"""Wait if rate limit would be exceeded"""
now = time.time()
# Remove old timestamps
self.timestamps = [t for t in self.timestamps if now - t < 60]
if len(self.timestamps) >= self.rpm:
sleep_time = 60 - (now - self.timestamps[0])
time.sleep(sleep_time)
self.timestamps.append(time.time())
# Retry with exponential backoff
@retry(
wait=wait_exponential(multiplier=1, min=4, max=60),
stop=stop_after_attempt(5)
)
def call_llm_with_retry(prompt: str) -> str:
try:
return llm.generate(prompt)
except RateLimitError:
raise # Will trigger retry
except APIError as e:
if e.status_code >= 500:
raise # Retry server errors
raise # Don't retry client errors
```
### 5.3 Fallback Strategy
```python
class LLMWithFallback:
def __init__(self, primary: str, fallbacks: list[str]):
self.primary = primary
self.fallbacks = fallbacks
def generate(self, prompt: str, **kwargs) -> str:
models = [self.primary] + self.fallbacks
for model in models:
try:
return llm.generate(prompt, model=model, **kwargs)
except (RateLimitError, APIError) as e:
logging.warning(f"Model {model} failed: {e}")
continue
raise AllModelsFailedError("All models exhausted")
# Usage
llm_client = LLMWithFallback(
primary="gpt-4-turbo",
fallbacks=["gpt-3.5-turbo", "claude-3-sonnet"]
)
```
---
## Architecture Decision Matrix
| Pattern | Use When | Complexity | Cost |
| :------------------- | :--------------- | :--------- | :-------- |
| **Simple RAG** | FAQ, docs search | Low | Low |
| **Hybrid RAG** | Mixed queries | Medium | Medium |
| **ReAct Agent** | Multi-step tasks | Medium | Medium |
| **Function Calling** | Structured tools | Low | Low |
| **Plan-Execute** | Complex tasks | High | High |
| **Multi-Agent** | Research tasks | Very High | Very High |
---
## Resources
- [Dify Platform](https://github.com/langgenius/dify)
- [LangChain Docs](https://python.langchain.com/)
- [LlamaIndex](https://www.llamaindex.ai/)
- [Anthropic Cookbook](https://github.com/anthropics/anthropic-cookbook)Related Skills
multi-agent-patterns
This skill should be used when the user asks to "design multi-agent system", "implement supervisor pattern", "create swarm architecture", "coordinate multiple agents", or mentions multi-agent patterns, context isolation, agent handoffs, sub-agents, or parallel agent execution.
modern-javascript-patterns
Comprehensive guide for mastering modern JavaScript (ES6+) features, functional programming patterns, and best practices for writing clean, maintainable, and performant code.
microservices-patterns
Master microservices architecture patterns including service boundaries, inter-service communication, data management, and resilience patterns for building distributed systems.
javascript-testing-patterns
Comprehensive guide for implementing robust testing strategies in JavaScript/TypeScript applications using modern testing frameworks and best practices.
error-handling-patterns
Build resilient applications with robust error handling strategies that gracefully handle failures and provide excellent debugging experiences.
e2e-testing-patterns
Build reliable, fast, and maintainable end-to-end test suites that provide confidence to ship code quickly and catch regressions before users do.
dbt-transformation-patterns
Production-ready patterns for dbt (data build tool) including model organization, testing strategies, documentation, and incremental processing.
binary-analysis-patterns
Comprehensive patterns and techniques for analyzing compiled binaries, understanding assembly code, and reconstructing program logic.
bash-defensive-patterns
Master defensive Bash programming techniques for production-grade scripts. Use when writing robust shell scripts, CI/CD pipelines, or system utilities requiring fault tolerance and safety.
zapier-make-patterns
No-code automation democratizes workflow building. Zapier and Make (formerly Integromat) let non-developers automate business processes without writing code. But no-code doesn't mean no-complexity - these platforms have their own patterns, pitfalls, and breaking points.
workflow-patterns
Use this skill when implementing tasks according to Conductor's TDD workflow, handling phase checkpoints, managing git commits for tasks, or understanding the verification protocol.
workflow-orchestration-patterns
Master workflow orchestration architecture with Temporal, covering fundamental design decisions, resilience patterns, and best practices for building reliable distributed systems.