rag-implementation
Build Retrieval-Augmented Generation (RAG) systems for LLM applications with vector databases and semantic search. Use when implementing knowledge-grounded AI, building document Q&A systems, or integrating LLMs with external knowledge bases.
Best use case
rag-implementation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Build Retrieval-Augmented Generation (RAG) systems for LLM applications with vector databases and semantic search. Use when implementing knowledge-grounded AI, building document Q&A systems, or integrating LLMs with external knowledge bases.
Teams using rag-implementation should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/rag-implementation/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How rag-implementation Compares
| Feature / Agent | rag-implementation | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Build Retrieval-Augmented Generation (RAG) systems for LLM applications with vector databases and semantic search. Use when implementing knowledge-grounded AI, building document Q&A systems, or integrating LLMs with external knowledge bases.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# RAG Implementation
Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.
## When to Use This Skill
- Building Q&A systems over proprietary documents
- Creating chatbots with current, factual information
- Implementing semantic search with natural language queries
- Reducing hallucinations with grounded responses
- Enabling LLMs to access domain-specific knowledge
- Building documentation assistants
- Creating research tools with source citation
## Core Components
### 1. Vector Databases
**Purpose**: Store and retrieve document embeddings efficiently
**Options:**
- **Pinecone**: Managed, scalable, fast queries
- **Weaviate**: Open-source, hybrid search
- **Milvus**: High performance, on-premise
- **Chroma**: Lightweight, easy to use
- **Qdrant**: Fast, filtered search
- **FAISS**: Meta's library, local deployment
### 2. Embeddings
**Purpose**: Convert text to numerical vectors for similarity search
**Models:**
- **text-embedding-ada-002** (OpenAI): General purpose, 1536 dims
- **all-MiniLM-L6-v2** (Sentence Transformers): Fast, lightweight
- **e5-large-v2**: High quality, multilingual
- **Instructor**: Task-specific instructions
- **bge-large-en-v1.5**: SOTA performance
### 3. Retrieval Strategies
**Approaches:**
- **Dense Retrieval**: Semantic similarity via embeddings
- **Sparse Retrieval**: Keyword matching (BM25, TF-IDF)
- **Hybrid Search**: Combine dense + sparse
- **Multi-Query**: Generate multiple query variations
- **HyDE**: Generate hypothetical documents
### 4. Reranking
**Purpose**: Improve retrieval quality by reordering results
**Methods:**
- **Cross-Encoders**: BERT-based reranking
- **Cohere Rerank**: API-based reranking
- **Maximal Marginal Relevance (MMR)**: Diversity + relevance
- **LLM-based**: Use LLM to score relevance
## Quick Start
```python
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# 1. Load documents
loader = DirectoryLoader('./docs', glob="**/*.txt")
documents = loader.load()
# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_documents(documents)
# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 4. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True
)
# 5. Query
result = qa_chain({"query": "What are the main features?"})
print(result['result'])
print(result['source_documents'])
```
## Advanced RAG Patterns
### Pattern 1: Hybrid Search
```python
from langchain.retrievers import BM25Retriever, EnsembleRetriever
# Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Dense retriever (embeddings)
embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Combine with weights
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, embedding_retriever],
weights=[0.3, 0.7]
)
```
### Pattern 2: Multi-Query Retrieval
```python
from langchain.retrievers.multi_query import MultiQueryRetriever
# Generate multiple query perspectives
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=OpenAI()
)
# Single query → multiple variations → combined results
results = retriever.get_relevant_documents("What is the main topic?")
```
### Pattern 3: Contextual Compression
```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever()
)
# Returns only relevant parts of documents
compressed_docs = compression_retriever.get_relevant_documents("query")
```
### Pattern 4: Parent Document Retriever
```python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# Store for parent documents
store = InMemoryStore()
# Small chunks for retrieval, large chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
```
## Document Chunking Strategies
### Recursive Character Text Splitter
```python
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
)
```
### Token-Based Splitting
```python
from langchain.text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50
)
```
### Semantic Chunking
```python
from langchain.text_splitters import SemanticChunker
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
```
### Markdown Header Splitter
```python
from langchain.text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
```
## Vector Store Configurations
### Pinecone
```python
import pinecone
from langchain.vectorstores import Pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("your-index-name")
vectorstore = Pinecone(index, embeddings.embed_query, "text")
```
### Weaviate
```python
import weaviate
from langchain.vectorstores import Weaviate
client = weaviate.Client("http://localhost:8080")
vectorstore = Weaviate(client, "Document", "content", embeddings)
```
### Chroma (Local)
```python
from langchain.vectorstores import Chroma
vectorstore = Chroma(
collection_name="my_collection",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
```
## Retrieval Optimization
### 1. Metadata Filtering
```python
# Add metadata during indexing
chunks_with_metadata = []
for i, chunk in enumerate(chunks):
chunk.metadata = {
"source": chunk.metadata.get("source"),
"page": i,
"category": determine_category(chunk.page_content)
}
chunks_with_metadata.append(chunk)
# Filter during retrieval
results = vectorstore.similarity_search(
"query",
filter={"category": "technical"},
k=5
)
```
### 2. Maximal Marginal Relevance
```python
# Balance relevance with diversity
results = vectorstore.max_marginal_relevance_search(
"query",
k=5,
fetch_k=20, # Fetch 20, return top 5 diverse
lambda_mult=0.5 # 0=max diversity, 1=max relevance
)
```
### 3. Reranking with Cross-Encoder
```python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Get initial results
candidates = vectorstore.similarity_search("query", k=20)
# Rerank
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
# Sort by score and take top k
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
```
## Prompt Engineering for RAG
### Contextual Prompt
```python
prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:"""
```
### With Citations
```python
prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc.
Context:
{context}
Question: {question}
Answer (with citations):"""
```
### With Confidence
```python
prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer.
Context:
{context}
Question: {question}
Answer:
Confidence:"""
```
## Evaluation Metrics
```python
def evaluate_rag_system(qa_chain, test_cases):
metrics = {
'accuracy': [],
'retrieval_quality': [],
'groundedness': []
}
for test in test_cases:
result = qa_chain({"query": test['question']})
# Check if answer matches expected
accuracy = calculate_accuracy(result['result'], test['expected'])
metrics['accuracy'].append(accuracy)
# Check if relevant docs were retrieved
retrieval_quality = evaluate_retrieved_docs(
result['source_documents'],
test['relevant_docs']
)
metrics['retrieval_quality'].append(retrieval_quality)
# Check if answer is grounded in context
groundedness = check_groundedness(
result['result'],
result['source_documents']
)
metrics['groundedness'].append(groundedness)
return {k: sum(v)/len(v) for k, v in metrics.items()}
```
## Resources
- **references/vector-databases.md**: Detailed comparison of vector DBs
- **references/embeddings.md**: Embedding model selection guide
- **references/retrieval-strategies.md**: Advanced retrieval techniques
- **references/reranking.md**: Reranking methods and when to use them
- **references/context-window.md**: Managing context limits
- **assets/vector-store-config.yaml**: Configuration templates
- **assets/retriever-pipeline.py**: Complete RAG pipeline
- **assets/embedding-models.md**: Model comparison and benchmarks
## Best Practices
1. **Chunk Size**: Balance between context and specificity (500-1000 tokens)
2. **Overlap**: Use 10-20% overlap to preserve context at boundaries
3. **Metadata**: Include source, page, timestamp for filtering and debugging
4. **Hybrid Search**: Combine semantic and keyword search for best results
5. **Reranking**: Improve top results with cross-encoder
6. **Citations**: Always return source documents for transparency
7. **Evaluation**: Continuously test retrieval quality and answer accuracy
8. **Monitoring**: Track retrieval metrics in production
## Common Issues
- **Poor Retrieval**: Check embedding quality, chunk size, query formulation
- **Irrelevant Results**: Add metadata filtering, use hybrid search, rerank
- **Missing Information**: Ensure documents are properly indexed
- **Slow Queries**: Optimize vector store, use caching, reduce k
- **Hallucinations**: Improve grounding prompt, add verification stepRelated Skills
notion-spec-to-implementation
Turns product or tech specs into concrete Notion tasks that Claude code can implement. Breaks down spec pages into detailed implementation plans with clear tasks, acceptance criteria, and progress tracking to guide development from requirements to completion.
auth-implementation-patterns
Master authentication and authorization patterns including JWT, OAuth2, session management, and RBAC to build secure, scalable access control systems. Use when implementing auth systems, securing APIs, or debugging security issues.
zustand-state-management
Build type-safe global state in React applications with Zustand. Supports TypeScript, persist middleware, devtools, slices pattern, and Next.js SSR. Use when setting up React state, migrating from Redux/Context API, implementing localStorage persistence, or troubleshooting Next.js hydration errors, TypeScript inference issues, or infinite render loops.
zinc-database
Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.
zarr-python
Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.
youtube-transcript
Download YouTube video transcripts when user provides a YouTube URL or asks to download/get/fetch a transcript from YouTube. Also use when user wants to transcribe or get captions/subtitles from a YouTube video.
xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
wordpress-plugin-core
Build secure WordPress plugins with core patterns for hooks, database interactions, Settings API, custom post types, REST API, and AJAX. Covers three architecture patterns (Simple, OOP, PSR-4) and the Security Trinity. Use when creating plugins, implementing nonces/sanitization/escaping, working with $wpdb prepared statements, or troubleshooting SQL injection, XSS, CSRF vulnerabilities, or plugin activation errors.
whisper
OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.
weights-and-biases
Track ML experiments with automatic logging, visualize training in real-time, optimize hyperparameters with sweeps, and manage model registry with W&B - collaborative MLOps platform
webapp-testing
Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.
serving-llms-vllm
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.