eval-rag
Evaluate retrieval and generation quality in RAG pipelines. Separate scoring for retrieval (recall, precision, MRR) and generation (faithfulness, relevance, completeness). End-to-end pipeline assessment with bottleneck identification. Triggers on: "eval rag", "rag evaluation", "retrieval evaluation", "rag quality", "rag metrics"
Best use case
eval-rag is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Evaluate retrieval and generation quality in RAG pipelines. Separate scoring for retrieval (recall, precision, MRR) and generation (faithfulness, relevance, completeness). End-to-end pipeline assessment with bottleneck identification. Triggers on: "eval rag", "rag evaluation", "retrieval evaluation", "rag quality", "rag metrics"
Teams using eval-rag should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/eval-rag/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How eval-rag Compares
| Feature / Agent | eval-rag | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Evaluate retrieval and generation quality in RAG pipelines. Separate scoring for retrieval (recall, precision, MRR) and generation (faithfulness, relevance, completeness). End-to-end pipeline assessment with bottleneck identification. Triggers on: "eval rag", "rag evaluation", "retrieval evaluation", "rag quality", "rag metrics"
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# /eval-rag > Evaluate retrieval + generation quality in RAG pipelines. ## Purpose Assess a Retrieval-Augmented Generation pipeline by evaluating retrieval and generation independently, then measuring end-to-end quality. Retrieval scoring checks if the right documents are fetched (recall, precision, MRR, NDCG). Generation scoring checks if the answer is faithful to retrieved context (no hallucination), relevant to the question, and complete. Identifies whether failures originate in retrieval, generation, or both, so you know where to invest improvement effort. ## Usage ```bash # Full RAG evaluation /eval-rag --pipeline rag/ --queries eval/queries.jsonl --golden eval/golden.jsonl # Evaluate retrieval only /eval-rag --pipeline rag/ --queries eval/queries.jsonl --golden eval/golden.jsonl --stage retrieval # Evaluate generation only (with pre-fetched contexts) /eval-rag --contexts retrieved.jsonl --queries eval/queries.jsonl --golden eval/golden.jsonl --stage generation # Custom retrieval depth /eval-rag --pipeline rag/ --queries eval/queries.jsonl --golden eval/golden.jsonl --k 10 # Compare two retrieval configurations /eval-rag --pipeline rag-v1/ --pipeline-b rag-v2/ --queries eval/queries.jsonl --golden eval/golden.jsonl ``` ## Arguments | Flag | Type | Default | Description | |------|------|---------|-------------| | `--pipeline` | string | required | Path to RAG pipeline configuration or directory | | `--pipeline-b` | string | — | Second pipeline for A/B comparison | | `--queries` | string | required | Path to evaluation queries (JSONL) | | `--golden` | string | required | Path to golden answers with relevant doc IDs (JSONL) | | `--contexts` | string | — | Pre-retrieved contexts (skips retrieval stage) | | `--stage` | enum | `both` | Evaluate: `retrieval`, `generation`, `both` | | `--k` | int | `5` | Retrieval depth (top-K documents) | | `--output` | string | stdout | Write report to file | | `--format` | enum | `markdown` | Output format: `markdown`, `json` | | `--faithfulness-judge` | string | built-in | Custom judge prompt for faithfulness scoring | | `--sample` | int | all | Sample size from query set | ## Workflow 1. **Load** — Parse pipeline config, queries, and golden answers. Each query should have: question, relevant document IDs (for retrieval scoring), and golden answer (for generation scoring). 2. **Retrieval evaluation** — For each query, run retrieval and compare fetched documents against golden relevant docs. Compute per-query and aggregate metrics: - **Recall@K**: fraction of relevant docs retrieved in top K - **Precision@K**: fraction of top K that are relevant - **MRR**: reciprocal rank of first relevant doc - **NDCG@K**: normalized discounted cumulative gain - **Hit rate**: fraction of queries with at least one relevant doc in top K 3. **Context analysis** — Examine retrieved contexts for: relevance distribution (how many retrieved docs are actually useful), noise ratio (irrelevant docs that might confuse generation), context ordering (is the most relevant doc first). 4. **Generation evaluation** — For each query + retrieved context, run generation and score the output: - **Faithfulness**: Does the answer only use information from retrieved context? (No hallucination) - **Relevance**: Does the answer address the query? - **Completeness**: Does the answer cover all aspects of the golden answer? - **Conciseness**: Is the answer free of unnecessary information? 5. **Bottleneck identification** — Cross-reference retrieval and generation scores. Classify each failure as: retrieval failure (right answer not in context), generation failure (right context but wrong answer), or compound failure (both). 6. **Comparison** — If `--pipeline-b` is provided, run both pipelines and produce a side-by-side comparison with statistical significance tests. 7. **Report** — Produce the full evaluation report with per-stage metrics, bottleneck analysis, and improvement recommendations. ## Examples ### Full RAG evaluation ``` /eval-rag --pipeline rag/ --queries eval/queries.jsonl --golden eval/golden.jsonl --k 5 ## RAG Evaluation Report ### Retrieval Metrics (K=5) | Metric | Score | |--------|-------| | Recall@5 | 0.78 | | Precision@5 | 0.41 | | MRR | 0.72 | | NDCG@5 | 0.68 | | Hit Rate | 0.89 | ### Generation Metrics | Metric | Score | |--------|-------| | Faithfulness | 0.91 | | Relevance | 0.85 | | Completeness | 0.67 | | Conciseness | 0.88 | ### Bottleneck Analysis | Failure Type | Count | % of Failures | |-------------|-------|---------------| | Retrieval failure | 31 | 58.5% | | Generation failure | 14 | 26.4% | | Compound failure | 8 | 15.1% | ### Recommendation Primary bottleneck is retrieval (58.5% of failures). Focus on: 1. Improve chunking strategy — current chunks miss relevant context 2. Add hybrid search (keyword + semantic) — 12 queries failed on keyword-dependent lookups 3. Increase K to 10 for complex queries — recall jumps to 0.89 at K=10 ``` ## Output ```markdown ## RAG Evaluation Report ### Pipeline: <path> ### Queries: N evaluated ### Retrieval Metrics | Metric | Score | CI | |--------|-------|----| ### Generation Metrics | Metric | Score | CI | |--------|-------|----| ### Bottleneck Analysis | Type | Count | % | |------|-------|----| ### Per-Query Breakdown (worst N) | Query | Retrieval | Generation | Failure Type | ### Recommendations 1. ... ### Comparison (if --pipeline-b) | Metric | Pipeline A | Pipeline B | Delta | Significant? | ``` ## Dependencies - RAG pipeline (retrieval + generation components) - Evaluation queries with golden answers and relevant doc IDs - `/judge-prompt` — For custom faithfulness judges - `/eval-audit` — Upstream pipeline health check - `/error-analysis` — Deep-dive on failure patterns - LLM access for generation evaluation scoring
Related Skills
eval-audit
Audit an LLM evaluation pipeline for correctness, coverage, and reliability. 6 diagnostic areas with structured Check/Finding output. Produces prioritized findings by severity and recommends next skills to run. Catches common eval pitfalls before they corrupt your metrics. Triggers on: "eval audit", "audit evals", "evaluation audit", "check eval pipeline", "eval health"
validate-evaluator
Calibrate LLM-as-Judge evaluators against human labels. Computes TPR, TNR, precision, recall, F1, and Cohen's kappa. Detects systematic biases and recommends prompt corrections. Produces a calibration report with confidence intervals. Triggers on: "validate evaluator", "calibrate judge", "judge accuracy", "evaluator validation", "judge metrics"
/do
> The agent's primary skill. Customize this to match your agent's purpose.
/report
> Generate structured reports. Director-owned.
/primary
> Main workflow execution and routing. Director-owned.
Qualify
## Command
Prospect
## Command
Close Plan
## Command
Battlecard
## Command
Spec
## Command
Schedule
## Command
Repurpose
## Command