eval-rag

Evaluate retrieval and generation quality in RAG pipelines. Separate scoring for retrieval (recall, precision, MRR) and generation (faithfulness, relevance, completeness). End-to-end pipeline assessment with bottleneck identification. Triggers on: "eval rag", "rag evaluation", "retrieval evaluation", "rag quality", "rag metrics"

170 stars

byMiosa-osa

View on GitHub Installation ↓

Best use case

eval-rag is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using eval-rag should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-rag/SKILL.md --create-dirs "https://raw.githubusercontent.com/Miosa-osa/canopy/main/library/skills/ai-patterns/eval-rag/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/eval-rag/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How eval-rag Compares

Feature / Agent	eval-rag	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# /eval-rag

> Evaluate retrieval + generation quality in RAG pipelines.

## Purpose

Assess a Retrieval-Augmented Generation pipeline by evaluating retrieval and generation independently, then measuring end-to-end quality. Retrieval scoring checks if the right documents are fetched (recall, precision, MRR, NDCG). Generation scoring checks if the answer is faithful to retrieved context (no hallucination), relevant to the question, and complete. Identifies whether failures originate in retrieval, generation, or both, so you know where to invest improvement effort.

## Usage

```bash
# Full RAG evaluation
/eval-rag --pipeline rag/ --queries eval/queries.jsonl --golden eval/golden.jsonl

# Evaluate retrieval only
/eval-rag --pipeline rag/ --queries eval/queries.jsonl --golden eval/golden.jsonl --stage retrieval

# Evaluate generation only (with pre-fetched contexts)
/eval-rag --contexts retrieved.jsonl --queries eval/queries.jsonl --golden eval/golden.jsonl --stage generation

# Custom retrieval depth
/eval-rag --pipeline rag/ --queries eval/queries.jsonl --golden eval/golden.jsonl --k 10

# Compare two retrieval configurations
/eval-rag --pipeline rag-v1/ --pipeline-b rag-v2/ --queries eval/queries.jsonl --golden eval/golden.jsonl
```

## Arguments

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--pipeline` | string | required | Path to RAG pipeline configuration or directory |
| `--pipeline-b` | string | — | Second pipeline for A/B comparison |
| `--queries` | string | required | Path to evaluation queries (JSONL) |
| `--golden` | string | required | Path to golden answers with relevant doc IDs (JSONL) |
| `--contexts` | string | — | Pre-retrieved contexts (skips retrieval stage) |
| `--stage` | enum | `both` | Evaluate: `retrieval`, `generation`, `both` |
| `--k` | int | `5` | Retrieval depth (top-K documents) |
| `--output` | string | stdout | Write report to file |
| `--format` | enum | `markdown` | Output format: `markdown`, `json` |
| `--faithfulness-judge` | string | built-in | Custom judge prompt for faithfulness scoring |
| `--sample` | int | all | Sample size from query set |

## Workflow

1. **Load** — Parse pipeline config, queries, and golden answers. Each query should have: question, relevant document IDs (for retrieval scoring), and golden answer (for generation scoring).
2. **Retrieval evaluation** — For each query, run retrieval and compare fetched documents against golden relevant docs. Compute per-query and aggregate metrics:
   - **Recall@K**: fraction of relevant docs retrieved in top K
   - **Precision@K**: fraction of top K that are relevant
   - **MRR**: reciprocal rank of first relevant doc
   - **NDCG@K**: normalized discounted cumulative gain
   - **Hit rate**: fraction of queries with at least one relevant doc in top K
3. **Context analysis** — Examine retrieved contexts for: relevance distribution (how many retrieved docs are actually useful), noise ratio (irrelevant docs that might confuse generation), context ordering (is the most relevant doc first).
4. **Generation evaluation** — For each query + retrieved context, run generation and score the output:
   - **Faithfulness**: Does the answer only use information from retrieved context? (No hallucination)
   - **Relevance**: Does the answer address the query?
   - **Completeness**: Does the answer cover all aspects of the golden answer?
   - **Conciseness**: Is the answer free of unnecessary information?
5. **Bottleneck identification** — Cross-reference retrieval and generation scores. Classify each failure as: retrieval failure (right answer not in context), generation failure (right context but wrong answer), or compound failure (both).
6. **Comparison** — If `--pipeline-b` is provided, run both pipelines and produce a side-by-side comparison with statistical significance tests.
7. **Report** — Produce the full evaluation report with per-stage metrics, bottleneck analysis, and improvement recommendations.

## Examples

### Full RAG evaluation
```
/eval-rag --pipeline rag/ --queries eval/queries.jsonl --golden eval/golden.jsonl --k 5

## RAG Evaluation Report

### Retrieval Metrics (K=5)
| Metric | Score |
|--------|-------|
| Recall@5 | 0.78 |
| Precision@5 | 0.41 |
| MRR | 0.72 |
| NDCG@5 | 0.68 |
| Hit Rate | 0.89 |

### Generation Metrics
| Metric | Score |
|--------|-------|
| Faithfulness | 0.91 |
| Relevance | 0.85 |
| Completeness | 0.67 |
| Conciseness | 0.88 |

### Bottleneck Analysis
| Failure Type | Count | % of Failures |
|-------------|-------|---------------|
| Retrieval failure | 31 | 58.5% |
| Generation failure | 14 | 26.4% |
| Compound failure | 8 | 15.1% |

### Recommendation
Primary bottleneck is retrieval (58.5% of failures). Focus on:
1. Improve chunking strategy — current chunks miss relevant context
2. Add hybrid search (keyword + semantic) — 12 queries failed on keyword-dependent lookups
3. Increase K to 10 for complex queries — recall jumps to 0.89 at K=10
```

## Output

```markdown
## RAG Evaluation Report

### Pipeline: <path>
### Queries: N evaluated

### Retrieval Metrics
| Metric | Score | CI |
|--------|-------|----|

### Generation Metrics
| Metric | Score | CI |
|--------|-------|----|

### Bottleneck Analysis
| Type | Count | % |
|------|-------|----|

### Per-Query Breakdown (worst N)
| Query | Retrieval | Generation | Failure Type |

### Recommendations
1. ...

### Comparison (if --pipeline-b)
| Metric | Pipeline A | Pipeline B | Delta | Significant? |
```

## Dependencies

- RAG pipeline (retrieval + generation components)
- Evaluation queries with golden answers and relevant doc IDs
- `/judge-prompt` — For custom faithfulness judges
- `/eval-audit` — Upstream pipeline health check
- `/error-analysis` — Deep-dive on failure patterns
- LLM access for generation evaluation scoring

Related Skills

eval-audit

170

from Miosa-osa/canopy

Audit an LLM evaluation pipeline for correctness, coverage, and reliability. 6 diagnostic areas with structured Check/Finding output. Produces prioritized findings by severity and recommends next skills to run. Catches common eval pitfalls before they corrupt your metrics. Triggers on: "eval audit", "audit evals", "evaluation audit", "check eval pipeline", "eval health"

validate-evaluator

170

from Miosa-osa/canopy

Calibrate LLM-as-Judge evaluators against human labels. Computes TPR, TNR, precision, recall, F1, and Cohen's kappa. Detects systematic biases and recommends prompt corrections. Produces a calibration report with confidence intervals. Triggers on: "validate evaluator", "calibrate judge", "judge accuracy", "evaluator validation", "judge metrics"