rag-methodology-guide

RAG architecture for academic knowledge retrieval and synthesis

191 stars

Best use case

rag-methodology-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

RAG architecture for academic knowledge retrieval and synthesis

Teams using rag-methodology-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/rag-methodology-guide/SKILL.md --create-dirs "https://raw.githubusercontent.com/wentorai/research-plugins/main/skills/tools/knowledge-graph/rag-methodology-guide/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/rag-methodology-guide/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How rag-methodology-guide Compares

Feature / Agentrag-methodology-guideStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

RAG architecture for academic knowledge retrieval and synthesis

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# RAG Methodology Guide

Design and implement Retrieval-Augmented Generation (RAG) systems for academic research, including document chunking, embedding strategies, retrieval pipelines, and evaluation.

## What Is RAG?

Retrieval-Augmented Generation (RAG) augments a language model's generation with relevant information retrieved from an external knowledge base. For academic research, this enables:

- Question answering over a personal paper library
- Literature synthesis across hundreds of papers
- Fact-checking claims against source documents
- Generating citations with provenance

### RAG Pipeline Architecture

```
Query: "What are the main challenges of protein folding?"
    |
    v
[1. Query Processing]
    |-- Embed query using embedding model
    |-- Optional: Query expansion / HyDE
    |
    v
[2. Retrieval]
    |-- Search vector database for top-k relevant chunks
    |-- Optional: Reranking with cross-encoder
    |
    v
[3. Context Assembly]
    |-- Combine retrieved chunks into a prompt
    |-- Add metadata (source, page, citation)
    |
    v
[4. Generation]
    |-- LLM generates answer grounded in retrieved context
    |-- Include inline citations
    |
    v
Answer with citations
```

## Step 1: Document Ingestion and Chunking

### Chunking Strategies

| Strategy | Description | Best For |
|----------|-------------|----------|
| **Fixed-size** | Split every N characters/tokens | Simple, fast, baseline |
| **Sentence-based** | Split on sentence boundaries | Natural reading units |
| **Paragraph-based** | Split on paragraph breaks | Coherent semantic units |
| **Section-based** | Split on document headings | Academic papers |
| **Recursive** | Hierarchically split (heading > paragraph > sentence) | General purpose |
| **Semantic** | Split on topic shifts using embeddings | Best quality, slower |

### Implementation

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_academic_paper(text, chunk_size=1000, chunk_overlap=200):
    """Chunk an academic paper using recursive splitting."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=[
            "\n## ",     # H2 headings (section breaks)
            "\n### ",    # H3 headings (subsection breaks)
            "\n\n",      # Paragraph breaks
            "\n",        # Line breaks
            ". ",        # Sentence breaks
            " ",         # Word breaks
        ],
        length_function=len
    )
    chunks = splitter.split_text(text)
    return chunks

# Add metadata to each chunk
def create_documents(paper_text, metadata):
    """Create chunks with source metadata for citation tracking."""
    chunks = chunk_academic_paper(paper_text)
    documents = []
    for i, chunk in enumerate(chunks):
        documents.append({
            "text": chunk,
            "metadata": {
                **metadata,
                "chunk_index": i,
                "chunk_total": len(chunks)
            }
        })
    return documents

# Example usage
docs = create_documents(
    paper_text=extracted_text,
    metadata={
        "title": "Attention Is All You Need",
        "authors": "Vaswani et al.",
        "year": 2017,
        "doi": "10.48550/arXiv.1706.03762",
        "source_file": "vaswani2017attention.pdf"
    }
)
```

## Step 2: Embedding and Indexing

### Embedding Model Selection

| Model | Dimensions | Quality | Speed | Cost |
|-------|-----------|---------|-------|------|
| OpenAI text-embedding-3-small | 1536 | Good | Fast | $0.02/1M tokens |
| OpenAI text-embedding-3-large | 3072 | Excellent | Fast | $0.13/1M tokens |
| Cohere embed-v3 | 1024 | Excellent | Fast | $0.10/1M tokens |
| sentence-transformers/all-MiniLM-L6-v2 | 384 | Good | Very fast | Free (local) |
| BAAI/bge-large-en-v1.5 | 1024 | Excellent | Medium | Free (local) |
| nomic-embed-text | 768 | Good | Fast | Free (local) |

### Vector Database Options

| Database | Type | Scalability | Features |
|----------|------|------------|----------|
| ChromaDB | Embedded | Small-medium | Simple, good for prototyping |
| FAISS | Library | Large | Facebook research, GPU support |
| Pinecone | Cloud | Large | Managed, serverless |
| Weaviate | Self-hosted/Cloud | Large | Hybrid search, filters |
| Qdrant | Self-hosted/Cloud | Large | Rich filtering, payload storage |
| pgvector | PostgreSQL extension | Medium | SQL integration |

### Building the Index

```python
import chromadb
from sentence_transformers import SentenceTransformer

# Initialize embedding model (local, free)
embed_model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Initialize ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="research_papers",
    metadata={"hnsw:space": "cosine"}
)

# Index documents
def index_documents(documents):
    """Add documents to the vector database."""
    texts = [doc["text"] for doc in documents]
    embeddings = embed_model.encode(texts, show_progress_bar=True).tolist()
    ids = [f"doc_{i}" for i in range(len(documents))]
    metadatas = [doc["metadata"] for doc in documents]

    collection.add(
        documents=texts,
        embeddings=embeddings,
        metadatas=metadatas,
        ids=ids
    )
    print(f"Indexed {len(documents)} chunks")

index_documents(docs)
```

## Step 3: Retrieval

### Basic Retrieval

```python
def retrieve(query, top_k=5):
    """Retrieve the most relevant chunks for a query."""
    query_embedding = embed_model.encode([query]).tolist()

    results = collection.query(
        query_embeddings=query_embedding,
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    retrieved = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        retrieved.append({
            "text": doc,
            "metadata": meta,
            "similarity": 1 - dist  # Convert distance to similarity
        })

    return retrieved

# Example
results = retrieve("What are the main components of the Transformer architecture?")
for r in results:
    print(f"[{r['similarity']:.3f}] {r['metadata'].get('title', 'N/A')}")
    print(f"  {r['text'][:150]}...")
```

### Advanced Retrieval: Hybrid Search

```python
def hybrid_retrieve(query, top_k=5, alpha=0.7):
    """Combine dense (semantic) and sparse (keyword) retrieval."""

    # Dense retrieval (vector similarity)
    dense_results = retrieve(query, top_k=top_k * 2)

    # Sparse retrieval (BM25 keyword matching)
    from rank_bm25 import BM25Okapi

    # Assume all_documents is a list of all chunk texts
    tokenized_corpus = [doc.split() for doc in all_documents]
    bm25 = BM25Okapi(tokenized_corpus)
    bm25_scores = bm25.get_scores(query.split())
    sparse_top_k = bm25_scores.argsort()[-top_k * 2:][::-1]

    # Reciprocal Rank Fusion (RRF)
    rrf_scores = {}
    k = 60  # RRF constant

    for rank, result in enumerate(dense_results):
        doc_id = result["metadata"].get("chunk_index", rank)
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + alpha / (k + rank + 1)

    for rank, idx in enumerate(sparse_top_k):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + (1 - alpha) / (k + rank + 1)

    # Sort by RRF score and return top-k
    sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_results[:top_k]
```

## Step 4: Generation with Citations

```python
def generate_answer(query, retrieved_contexts):
    """Generate an answer with inline citations using an LLM."""

    # Build context string with citation markers
    context_parts = []
    for i, ctx in enumerate(retrieved_contexts, 1):
        source = f"{ctx['metadata'].get('authors', 'Unknown')}, {ctx['metadata'].get('year', 'N/A')}"
        context_parts.append(f"[{i}] ({source}): {ctx['text']}")

    context_string = "\n\n".join(context_parts)

    prompt = f"""Based on the following research paper excerpts, answer the question.
Use inline citations like [1], [2] to reference specific sources.
Only use information from the provided excerpts.
If the excerpts do not contain enough information, say so.

EXCERPTS:
{context_string}

QUESTION: {query}

ANSWER (with inline citations):"""

    # Send to LLM (example with OpenAI)
    # response = openai.chat.completions.create(
    #     model="gpt-4",
    #     messages=[{"role": "user", "content": prompt}],
    #     temperature=0.1
    # )
    # return response.choices[0].message.content

    return prompt  # Return prompt for inspection
```

## Evaluation Metrics

| Metric | Measures | Tool |
|--------|----------|------|
| **Retrieval precision** | Are retrieved chunks relevant? | Manual annotation |
| **Retrieval recall** | Are all relevant chunks retrieved? | Known-relevant set |
| **NDCG** | Ranking quality of retrieved results | BEIR benchmark |
| **Answer correctness** | Is the generated answer factually correct? | Human evaluation |
| **Faithfulness** | Does the answer only use information from retrieved context? | RAGAS framework |
| **Answer relevance** | Does the answer address the question? | RAGAS framework |
| **Context relevance** | Are the retrieved contexts relevant to the question? | RAGAS framework |

```python
# Using RAGAS for automated RAG evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Prepare evaluation dataset
eval_data = {
    "question": ["What is the Transformer architecture?"],
    "answer": ["The Transformer uses self-attention mechanisms..."],
    "contexts": [["The Transformer model architecture eschews recurrence..."]],
    "ground_truth": ["The Transformer is a neural network architecture..."]
}

result = evaluate(
    dataset=eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result)
```

## Best Practices for Academic RAG

1. **Chunk by section**: Academic papers have natural section boundaries. Use them.
2. **Preserve metadata**: Always store title, authors, year, DOI, and page number with each chunk for proper citation.
3. **Use domain-specific embeddings**: Models fine-tuned on scientific text (e.g., SPECTER2) outperform general models for academic content.
4. **Rerank after retrieval**: A cross-encoder reranker significantly improves precision over embedding-only retrieval.
5. **Handle tables and figures**: Extract tables as text or structured data; do not ignore them during chunking.
6. **Evaluate systematically**: Use RAGAS or a custom evaluation set to measure retrieval and generation quality before deploying.