scholarag

Build PRISMA 2020-compliant systematic literature review systems with RAG-powered analysis in VS Code. Use when researcher needs automated paper retrieval (Semantic Scholar, OpenAlex, arXiv), AI-assisted PRISMA screening (50% or 90% threshold), vector database creation (ChromaDB), or research conversation interface. Supports knowledge_repository (comprehensive, 15K+ papers, teaching/exploration) and systematic_review (publication-quality, 50-300 papers, meta-analysis) modes. Conversation-first workflow with 7 stages.

16 stars

Best use case

scholarag is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Build PRISMA 2020-compliant systematic literature review systems with RAG-powered analysis in VS Code. Use when researcher needs automated paper retrieval (Semantic Scholar, OpenAlex, arXiv), AI-assisted PRISMA screening (50% or 90% threshold), vector database creation (ChromaDB), or research conversation interface. Supports knowledge_repository (comprehensive, 15K+ papers, teaching/exploration) and systematic_review (publication-quality, 50-300 papers, meta-analysis) modes. Conversation-first workflow with 7 stages.

Teams using scholarag should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/scholarag-majiayu000/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/scholarag-majiayu000/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/scholarag-majiayu000/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How scholarag Compares

Feature / AgentscholaragStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Build PRISMA 2020-compliant systematic literature review systems with RAG-powered analysis in VS Code. Use when researcher needs automated paper retrieval (Semantic Scholar, OpenAlex, arXiv), AI-assisted PRISMA screening (50% or 90% threshold), vector database creation (ChromaDB), or research conversation interface. Supports knowledge_repository (comprehensive, 15K+ papers, teaching/exploration) and systematic_review (publication-quality, 50-300 papers, meta-analysis) modes. Conversation-first workflow with 7 stages.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# ScholaRAG: Systematic Review Automation Skill

**For**: Claude Code (AI assistant in VS Code)
**Purpose**: Guide researchers through PRISMA 2020 systematic literature review + RAG-powered analysis

---

## Quick Start (5 minutes)

### For Researchers
1. **Initialize project**: `python scholarag_cli.py init`
2. **Paste Stage 1 prompt**: Copy from [https://www.scholarag.com/guide/01-introduction](https://www.scholarag.com/guide/01-introduction)
3. **Answer Claude's questions** → Config created automatically
4. **Proceed through 7 stages** conversationally

### For AI Assistants (Claude Code)
When researcher provides a ScholaRAG prompt:
1. **Check for HTML metadata block** (`<!-- METADATA ... -->` at top of prompt)
2. **Identify current stage** (1-7) from metadata `stage` field
3. **Follow conversation pattern** (from metadata `conversation_flow`)
4. **Validate completion** (using metadata `validation_rules`)
5. **Auto-execute commands** (when `auto_execute: true`)
6. **Update `.claude/context.json`** (track progress)
7. **Show next stage prompt** (from metadata `next_stage.prompt_file`)

**Researcher should NEVER touch terminal**. You execute all scripts automatically.

---

## 7-Stage Workflow Overview

| Stage | Name | Read This File | Duration | Auto-Execute |
|-------|------|----------------|----------|--------------|
| 1 | Research Setup | [skills/claude_only/stage1_research_setup.md](skills/claude_only/stage1_research_setup.md) | 15-20 min | ✅ `scholarag init` |
| 2 | Query Strategy | [skills/claude_only/stage2_query_strategy.md](skills/claude_only/stage2_query_strategy.md) | 15-25 min | ❌ Manual review |
| 3 | PRISMA Config | [skills/claude_only/stage3_prisma_config.md](skills/claude_only/stage3_prisma_config.md) | 20-30 min | ❌ Manual review |
| 4 | RAG Design | [skills/claude_only/stage4_rag_design.md](skills/claude_only/stage4_rag_design.md) | 10-15 min | ❌ Manual review |
| 5 | Execution | [skills/claude_only/stage5_execution.md](skills/claude_only/stage5_execution.md) | 2-4 hours | ✅ Run all 5 scripts |
| 6 | Research Conversation | [skills/claude_only/stage6_research_conversation.md](skills/claude_only/stage6_research_conversation.md) | Ongoing | ❌ Interactive |
| 7 | Documentation | [skills/claude_only/stage7_documentation.md](skills/claude_only/stage7_documentation.md) | 30-60 min | ✅ Generate PRISMA |

**Progressive Disclosure**: Load stage file **only when researcher enters that stage**. Don't preload all 7 stages (token waste).

---

## Critical Branching Points

### 🔴 project_type (Stage 1 Decision)

**Two modes available**:

| Mode | Threshold | Output | Best For |
|------|-----------|--------|----------|
| `knowledge_repository` | 50% (lenient) | 15K-20K papers | Teaching, AI assistant, exploration |
| `systematic_review` | 90% (strict) | 50-300 papers | Meta-analysis, publication |

**Quick decision**:
- Publishing systematic review? → `systematic_review` ✅
- Comprehensive domain coverage? → `knowledge_repository` ✅

**Detailed decision tree**: [skills/reference/project_type_decision_tree.md](skills/reference/project_type_decision_tree.md)

**When to read decision tree**:
- Researcher asks: "Which project_type should I choose?"
- Researcher says: "I'm unsure about my research goals"
- Stage 1 initialization (proactively offer decision helper)

---

### 🔴 Stage 6 Scenarios (7 Research Modes)

**Stage 6 branches into 7 specialized conversation scenarios**:

1. **overview** (Context Scanning): High-level themes, methods, findings
2. **hypothesis** (Hypothesis Validation): Evidence for/against with effect sizes
3. **statistics** (Statistical Extraction): RCT data table (tools, Cohen's d, samples)
4. **methods** (Methodology Comparison): RCT vs quasi vs mixed methods
5. **contradictions** (Contradiction Detection): Conflicting results + analysis
6. **policy** (Policy Translation): Actionable recommendations for stakeholders
7. **grant** (Future Research Design): Follow-up study design + hypotheses

**Details**: [skills/claude_only/stage6_research_conversation.md](skills/claude_only/stage6_research_conversation.md)

**When to read**: Stage 6 entry (researcher asks "What can I query?")

---

## Error Recovery

**When errors occur**: [skills/reference/error_recovery.md](skills/reference/error_recovery.md)

**Quick fixes** (common issues):

| Error | Quick Fix | Detailed Guide |
|-------|-----------|----------------|
| Too many papers (>30K) | Refine query in Stage 2, re-run fetch | error_recovery.md §2.1 |
| API key missing | Add `ANTHROPIC_API_KEY` to `.env` | error_recovery.md §3.1 |
| Low PDF success (<30%) | Filter for `open_access` in Stage 1 | error_recovery.md §4.1 |
| All papers excluded (0 papers) | Lower threshold or broaden query | error_recovery.md §3.2 |

---

## Reference Materials (Load Only When Needed)

**Progressive disclosure**: Don't preload these. Read **only when** researcher asks specific questions.

| Topic | File | When to Read |
|-------|------|--------------|
| API endpoints | [skills/reference/api_reference.md](skills/reference/api_reference.md) | Researcher asks about Semantic Scholar, OpenAlex, arXiv |
| Config schema | [skills/reference/config_schema.md](skills/reference/config_schema.md) | Researcher asks "What fields are in config.yaml?" |
| PRISMA checklist | [skills/reference/prisma_guidelines.md](skills/reference/prisma_guidelines.md) | Researcher asks about PRISMA 2020 compliance |
| Troubleshooting | [skills/reference/troubleshooting.md](skills/reference/troubleshooting.md) | Researcher reports errors not in Quick Fixes |

---

## Architecture Overview

**File dependencies**: [https://www.scholarag.com/codebook/architecture](https://www.scholarag.com/codebook/architecture)

**Key principle**: Scripts read from `config.yaml` (single source of truth), **never hardcode values**.

**Critical scripts** (read `project_type` from config):
- `03_screen_papers.py`: Sets threshold (50% or 90%)
- `07_generate_prisma.py`: Changes diagram title ("Knowledge Repository" vs "Systematic Review")

---

## For Codex Users

**If researcher is using OpenAI Codex instead of Claude Code**:

See [AGENTS.md](AGENTS.md) for bash-based task workflows.

Codex workflow differs:
- **Task-oriented** (not conversation-oriented)
- **Bash commands** (not validation rules)
- **Exit codes** (not metadata parsing)

**Universal reference files** (Claude + Codex both use):
- `skills/reference/project_type_decision_tree.md`
- `skills/reference/api_reference.md`
- `skills/reference/config_schema.md`

---

## Token Optimization Notes

**This file**: ~400 lines (loaded once per conversation)

**Stage-specific files**: ~300-500 lines each (loaded on-demand)

**Total per conversation**: ~700 lines (this file + current stage file)

**Previous approach**: ~2,000 lines (all context upfront)

**Token reduction**: **65%** ✅

**How it works**:
1. Researcher starts Stage 1 → You load this file + `stage1_research_setup.md`
2. Researcher moves to Stage 2 → You load `stage2_query_strategy.md` (Stage 1 file unloaded)
3. Reference files loaded **only when** researcher asks (e.g., "How does Semantic Scholar API work?")

---

## Metadata Block Format

**All prompts in `prompts/*.md` contain HTML comment metadata at top**:

```html
<!-- METADATA
stage: 1
stage_name: "Research Domain Setup"
expected_duration: "15-20 minutes"
conversation_mode: "interactive"
expected_turns: "6-10"
outputs:
  required:
    - project_name: "Descriptive name"
    - research_question: "Clear, answerable question"
    - project_type: "knowledge_repository OR systematic_review"
validation_rules:
  project_type:
    required: true
    allowed_values: ["knowledge_repository", "systematic_review"]
cli_commands:
  - command: "python scholarag_cli.py init ..."
    auto_execute: true
next_stage:
  stage: 2
  prompt_file: "prompts/02_query_strategy.md"
-->
```

**How to use**:
1. **Parse YAML inside HTML comment** (lines between `<!-- METADATA` and `-->`)
2. **Extract fields**: `stage`, `expected_turns`, `validation_rules`, `cli_commands`, `next_stage`
3. **Follow conversation pattern**: Ask questions matching `expected_turns` count
4. **Validate**: Check user inputs against `validation_rules`
5. **Execute**: Run `cli_commands` when conversation complete
6. **Transition**: Show prompt from `next_stage.prompt_file`

---

## Divergence Handling

**Common researcher confusions** (from metadata `divergence_handling`):

### Divergence 1: "Can you help me download PDFs?" (in Stage 1)
**Response**: "PDF downloading happens in Stage 4 (after screening in Stage 3). Right now in Stage 1, let's first define your research scope and choose project_type. We'll design queries in Stage 2, configure PRISMA in Stage 3, then download PDFs in Stage 4."

### Divergence 2: "I want to skip systematic review" (in Stage 1)
**Response**: "If you don't need publication-quality systematic review, choose `project_type: knowledge_repository` in the next question. This mode uses lenient filtering (50% threshold) for comprehensive domain coverage (15K-20K papers). It's perfect for teaching materials, AI assistants, or exploratory research."

### Divergence 3: "What's the difference between the two modes?" (in Stage 1)
**Response**: "Let me explain:

**knowledge_repository**:
- 50% threshold (lenient, removes only spam)
- 15,000-20,000 papers output
- For: Teaching, exploration, AI assistant

**systematic_review**:
- 90% threshold (strict, PRISMA 2020)
- 50-300 papers output
- For: Meta-analysis, publication

See full decision tree: [skills/reference/project_type_decision_tree.md](skills/reference/project_type_decision_tree.md)"

---

## Conversation Flow Example (Stage 1)

**Typical pattern** (6-10 turns):

1. **Turn 1**: Researcher provides research topic
   - **You ask**: "Is this for exploratory domain mapping or publication-quality systematic review?"

2. **Turn 2-3**: Researcher answers scope questions
   - **You suggest**: `project_type` based on answers, explain threshold implications
   - **Example**: "Based on your goal of meta-analysis, I recommend `systematic_review` mode with 90% screening threshold."

3. **Turn 4-5**: Researcher confirms project_type choice
   - **You suggest**: Year range, publication types, expected databases
   - **Example**: "For language learning studies, I recommend 2015-2025 (10 years) focusing on Semantic Scholar and ERIC."

4. **Turn 6-8**: Researcher provides final details (domain, year range)
   - **You summarize**: All decisions, ask for confirmation
   - **Example**: "Here's what I'll create: [summary]. Ready to initialize?"

5. **Turn 9-10**: Researcher confirms initialization
   - **You execute**: `scholarag_cli.py init`, create `config.yaml`, show next steps
   - **Example**: "✅ Project initialized! Next, let's design your search query in Stage 2."

---

## Completion Checklist (Stage-Specific)

**Stage 1 example** (from metadata `completion_checklist`):

- [ ] `project_name` is descriptive and unique (≥10 chars)
- [ ] `research_question` is specific and answerable (≥20 chars)
- [ ] `project_type` chosen with understanding of implications (50% vs 90%)
- [ ] `year_range` is realistic for scope (≤25 years, not before 2000)
- [ ] `config.yaml` created successfully (file exists, valid YAML)

**When all checked** → Auto-execute `scholarag_cli.py init` → Show Stage 2 prompt

---

## Example Commands You Will Execute

### Stage 1: Initialize
```bash
python scholarag_cli.py init \
  --name "AI-Chatbots-Language-Learning" \
  --question "How do AI chatbots improve speaking proficiency in EFL learners?" \
  --domain education
```

### Stage 5: Run Pipeline (All 5 Scripts)
```bash
# Fetch papers
python scripts/01_fetch_papers.py --project projects/YYYY-MM-DD_ProjectName

# Deduplicate
python scripts/02_deduplicate.py --project projects/YYYY-MM-DD_ProjectName

# Screen with AI
python scripts/03_screen_papers.py --project projects/YYYY-MM-DD_ProjectName

# Download PDFs
python scripts/04_download_pdfs.py --project projects/YYYY-MM-DD_ProjectName

# Build RAG
python scripts/05_build_rag.py --project projects/YYYY-MM-DD_ProjectName
```

### Stage 7: Generate PRISMA
```bash
python scripts/07_generate_prisma.py --project projects/YYYY-MM-DD_ProjectName
```

---

## Integration with .claude/context.json

**You should update this file after each stage**:

```json
{
  "current_stage": {
    "stage": 2,
    "name": "Query Strategy",
    "status": "in_progress",
    "started_at": "2025-10-24T10:30:00Z"
  },
  "completed_stages": [
    {
      "stage": 1,
      "name": "Research Setup",
      "completed_at": "2025-10-24T10:25:00Z",
      "outputs": {
        "project_name": "AI-Chatbots-Language-Learning",
        "research_question": "How do AI chatbots improve speaking proficiency?",
        "project_type": "systematic_review"
      }
    }
  ],
  "project": {
    "name": "AI-Chatbots-Language-Learning",
    "created": "2025-10-24",
    "research_question": "How do AI chatbots improve speaking proficiency in EFL learners?",
    "project_type": "systematic_review"
  }
}
```

**Purpose**: Track progress, enable `scholarag status` command to show current stage.

---

## FAQ for AI Assistants

### Q: Should I always read stage files in order (1→2→3...)?
**A**: No! Read **only the file for the current stage** researcher is in. Use progressive disclosure.

### Q: What if researcher jumps to Stage 5 without completing Stages 1-4?
**A**: Check `.claude/context.json` for completed stages. If missing prerequisites, politely redirect:
"Stage 5 requires config.yaml from Stage 1, search_query from Stage 2, and PRISMA criteria from Stage 3. Let's complete those first."

### Q: When should I read `skills/reference/` files?
**A**: **Only when researcher explicitly asks**. Examples:
- "How does Semantic Scholar API work?" → Read `api_reference.md`
- "What are all the config.yaml fields?" → Read `config_schema.md`
- "Why should I choose systematic_review?" → Read `project_type_decision_tree.md`

### Q: What if I don't understand metadata in prompts/*.md?
**A**: All metadata fields are documented in `skills/claude_only/metadata_spec.md`. Read that file if you encounter unknown fields.

---

## Additional Resources

**Detailed implementation guide**: See [CLAUDE.md](CLAUDE.md) for:
- 🎓 User profile (researchers with limited coding experience)
- How Claude Code should behave (DO/DON'T guidelines)
- Auto-execution patterns (echo pipes, CLI arguments)
- Full CLI reference and troubleshooting

**For Codex/Cursor users**: See [AGENTS.md](AGENTS.md) for task-based bash workflows

---

**Last Updated**: 2025-10-24 (v2.0 - Agent Skills Integration)
**Companion files**: CLAUDE.md (detailed guide), AGENTS.md (Codex workflows)
**Compatible with**: Claude Code v1.0+, Anthropic API
**Token Budget**: ~380 lines (this file) + ~300-500 lines (stage file) = ~700-900 lines per conversation

Related Skills

bgo

10
from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

skill-coach

16
from diegosouzapw/awesome-omni-skill

Guides creation of high-quality Agent Skills with domain expertise, anti-pattern detection, and progressive disclosure best practices. Use when creating skills, reviewing existing skills, or when users mention improving skill quality, encoding expertise, or avoiding common AI tooling mistakes. Activate on keywords: create skill, review skill, skill quality, skill best practices, skill anti-patterns. NOT for general coding advice or non-skill Claude Code features.

skild

16
from diegosouzapw/awesome-omni-skill

Skill package manager for AI Agents — install, manage, and publish Agent Skills.

sitrep-coordinator

16
from diegosouzapw/awesome-omni-skill

Military-style Situation Report (SITREP) generation for multi-agent coordination. Creates structured status updates with completed/in-progress/blocked sections, authorization codes, handoff protocols, and clear next actions. Optimized for complex project management across multiple AI agents and human operators.

sitespeakai-automation

16
from diegosouzapw/awesome-omni-skill

Automate Sitespeakai tasks via Rube MCP (Composio). Always search tools first for current schemas.

simulation-dry-run

16
from diegosouzapw/awesome-omni-skill

How to run scenario tests against Gorlami fork RPCs (dry runs) before broadcasting live transactions. Covers config, seeding balances, runner flags, and safe script patterns.

simple-pr

16
from diegosouzapw/awesome-omni-skill

Create a simple PR from staged changes with an auto-generated commit message

simple-gemini

16
from diegosouzapw/awesome-omni-skill

Collaborative documentation and test code writing workflow using zen mcp's clink to launch gemini CLI session in WSL (via 'gemini' command) where all writing operations are executed. Use this skill when the user requests "use gemini to write test files", "use gemini to write documentation", "generate related test files", "generate an explanatory document", or similar document/test writing tasks. The gemini CLI session acts as the specialist writer, working with the main Claude model for context gathering, outline approval, and final review. For test code, codex CLI (also launched via clink) validates quality after gemini completes writing.

simo-multiomics-integration-agent

16
from diegosouzapw/awesome-omni-skill

AI-powered spatial integration of multi-omics datasets using probabilistic alignment for comprehensive tissue atlas construction and cellular state mapping.

simla-com-automation

16
from diegosouzapw/awesome-omni-skill

Automate Simla Com tasks via Rube MCP (Composio). Always search tools first for current schemas.

simd-optimize

16
from diegosouzapw/awesome-omni-skill

SIMD vectorization for Rust — detects ISA features, identifies vectorizable patterns, generates platform-specific intrinsics (ARM NEON/SVE, x86 SSE/AVX/AVX-512), validates correctness and performance. Uses tiered research with baked-in references and /deep-research fallback.

signalwire-agents-sdk

16
from diegosouzapw/awesome-omni-skill

Expert assistance for building SignalWire AI Agents in Python. Automatically activates when working with AgentBase, SWAIG functions, skills, SWML, voice configuration, DataMap, or any signalwire_agents code. Provides patterns, best practices, and complete working examples.