llm-evaluation-guide
Evaluate and benchmark large language models for research applications
Best use case
llm-evaluation-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Evaluate and benchmark large language models for research applications
Teams using llm-evaluation-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/llm-evaluation-guide/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How llm-evaluation-guide Compares
| Feature / Agent | llm-evaluation-guide | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Evaluate and benchmark large language models for research applications
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# LLM Evaluation Guide
A skill for evaluating and benchmarking large language models (LLMs) in research settings. Covers automatic metrics, human evaluation protocols, benchmark suites, evaluation pitfalls, and best practices for reporting LLM performance.
## Evaluation Taxonomy
### Types of Evaluation
```
1. Intrinsic evaluation:
Measures model quality on its own terms
- Perplexity, likelihood, calibration
- Useful for comparing architectures and training procedures
2. Extrinsic evaluation:
Measures model quality on downstream tasks
- Task-specific benchmarks (QA, summarization, classification)
- Closer to real-world usefulness
3. Human evaluation:
Human judges rate model outputs
- Fluency, correctness, helpfulness, safety
- Gold standard but expensive and slow
```
## Automatic Metrics
### Common Metrics by Task
| Task | Metric | Description |
|------|--------|-------------|
| Language modeling | Perplexity | Lower is better; measures prediction quality |
| Machine translation | BLEU, COMET | N-gram overlap; learned quality estimation |
| Summarization | ROUGE-1/2/L | Recall of n-grams against reference |
| Question answering | Exact Match, F1 | Token-level match against reference answer |
| Classification | Accuracy, F1 | Standard classification metrics |
| Generation quality | BERTScore | Semantic similarity via embeddings |
| Factuality | FActScore | Proportion of atomic facts supported by evidence |
### Computing Key Metrics
```python
from collections import Counter
import math
def compute_bleu(reference: list[str], hypothesis: list[str],
max_n: int = 4) -> float:
"""
Compute corpus-level BLEU score (simplified).
Args:
reference: List of reference token sequences
hypothesis: List of hypothesis token sequences
max_n: Maximum n-gram order
"""
precisions = []
for n in range(1, max_n + 1):
num = 0
den = 0
for ref_tokens, hyp_tokens in zip(reference, hypothesis):
ref_ngrams = Counter(
tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens) - n + 1)
)
hyp_ngrams = Counter(
tuple(hyp_tokens[i:i+n]) for i in range(len(hyp_tokens) - n + 1)
)
clipped = {ng: min(c, ref_ngrams.get(ng, 0))
for ng, c in hyp_ngrams.items()}
num += sum(clipped.values())
den += max(sum(hyp_ngrams.values()), 1)
precisions.append(num / max(den, 1))
# Brevity penalty
ref_len = sum(len(r) for r in reference)
hyp_len = sum(len(h) for h in hypothesis)
bp = math.exp(1 - ref_len / max(hyp_len, 1)) if hyp_len < ref_len else 1.0
# Geometric mean of precisions
log_avg = sum(math.log(max(p, 1e-10)) for p in precisions) / max_n
return bp * math.exp(log_avg)
```
## Benchmark Suites
### Major LLM Benchmarks
```
General knowledge and reasoning:
- MMLU (Massive Multitask Language Understanding): 57 subjects, MCQ
- HellaSwag: Commonsense sentence completion
- ARC (AI2 Reasoning Challenge): Science questions
- WinoGrande: Coreference resolution / commonsense
Coding:
- HumanEval: Python function completion (pass@k)
- MBPP: Mostly basic Python problems
- SWE-bench: Real-world software engineering tasks
Math:
- GSM8K: Grade school math word problems
- MATH: Competition-level mathematics
Safety and alignment:
- TruthfulQA: Resistance to common misconceptions
- BBQ (Bias Benchmark for QA): Social bias in QA
- RealToxicityPrompts: Tendency to generate toxic text
Instruction following:
- MT-Bench: Multi-turn conversation quality (LLM-as-judge)
- AlpacaEval: Instruction-following quality
- Chatbot Arena: ELO-based human preference ranking
```
## Human Evaluation
### Designing a Human Evaluation Protocol
```python
def design_human_eval(task: str, n_annotators: int = 3,
n_examples: int = 200) -> dict:
"""
Design a human evaluation protocol for LLM outputs.
Args:
task: The task being evaluated
n_annotators: Number of independent annotators per example
n_examples: Number of examples to evaluate
"""
return {
"task": task,
"n_annotators": n_annotators,
"n_examples": n_examples,
"criteria": [
{"name": "Fluency", "scale": "1-5",
"description": "Is the text grammatically correct and natural?"},
{"name": "Relevance", "scale": "1-5",
"description": "Does the output address the input/question?"},
{"name": "Correctness", "scale": "1-5",
"description": "Is the factual content accurate?"},
{"name": "Helpfulness", "scale": "1-5",
"description": "Would a user find this response useful?"}
],
"agreement_metric": "Krippendorff's alpha (ordinal)",
"presentation": "Randomize model order; blind annotators to model identity",
"calibration": "Have all annotators rate 20 shared examples first",
"cost_estimate": f"~{n_examples * n_annotators * 0.50:.0f} USD at typical rates"
}
```
## Evaluation Pitfalls
### Common Mistakes
```
1. Data contamination:
Test data may appear in the LLM's training set.
Mitigation: Use held-out datasets, check for contamination,
create new test sets.
2. Metric gaming:
High BLEU does not mean high quality; ROUGE rewards verbosity.
Mitigation: Use multiple metrics and human evaluation.
3. Cherry-picking examples:
Showing only best-case outputs misrepresents model capabilities.
Mitigation: Report aggregate metrics over full test sets.
4. Ignoring variance:
LLM outputs vary with temperature and random seeds.
Mitigation: Report mean and standard deviation over multiple runs.
5. Unfair comparisons:
Comparing models with different prompt formats or few-shot counts.
Mitigation: Standardize prompts and report all hyperparameters.
```
## Reporting Standards
When publishing LLM evaluation results, report: model name and version, parameter count and architecture, evaluation dataset with version number, exact prompts used (include in appendix), number of few-shot examples, decoding parameters (temperature, top-p, max tokens), multiple metrics (not just one), confidence intervals or significance tests, and hardware and inference cost where relevant.Related Skills
thuthesis-guide
Write Tsinghua University theses using the ThuThesis LaTeX template
thesis-writing-guide
Templates, formatting rules, and strategies for thesis and dissertation writing
thesis-template-guide
Set up LaTeX templates for PhD and Master's thesis documents
sjtuthesis-guide
Write SJTU theses using the SJTUThesis LaTeX template with full compliance
novathesis-guide
LaTeX thesis template supporting multiple universities and formats
graphical-abstract-guide
Create SVG graphical abstracts for journal paper submissions
beamer-presentation-guide
Guide to creating academic presentations with LaTeX Beamer
plagiarism-detection-guide
Use plagiarism detection tools and ensure manuscript originality
paper-polish-guide
Review and polish LaTeX research papers for clarity and style
grammar-checker-guide
Use grammar and style checking tools to polish academic manuscripts
conciseness-editing-guide
Eliminate wordiness and redundancy in academic prose for clarity
academic-translation-guide
Academic translation, post-editing, and Chinglish correction guide