Ground Truth Management

Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning

16 stars

Best use case

Ground Truth Management is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning

Teams using Ground Truth Management should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ground-truth-management-majiayu000/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/ground-truth-management-majiayu000/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/ground-truth-management-majiayu000/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How Ground Truth Management Compares

Feature / AgentGround Truth ManagementStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Ground Truth Management

## What is Ground Truth?

**Definition:** Correct answers for evaluation - human-verified data that serves as the gold standard for measuring AI performance.

### Example
```
Question: "What is the capital of France?"
Ground Truth: "Paris"

AI Answer: "Paris" → Correct ✓
AI Answer: "Lyon" → Incorrect ✗
```

---

## Why Ground Truth Matters

### Measure Accuracy Objectively
```
Without ground truth: "This answer seems good" (subjective)
With ground truth: "Accuracy: 85%" (objective)
```

### Train and Validate Models
```
Training: Learn from ground truth examples
Validation: Measure performance on ground truth test set
```

### Regression Testing
```
Before change: Accuracy 90%
After change: Accuracy 85%
→ Regression detected!
```

### Benchmarking
```
Model A: 90% accuracy on ground truth
Model B: 85% accuracy on ground truth
→ Model A is better
```

---

## Types of Ground Truth

### Exact Match: Single Correct Answer
```json
{
  "question": "What is 2+2?",
  "answer": "4"
}
```

### Multiple Acceptable Answers
```json
{
  "question": "What is the capital of France?",
  "acceptable_answers": ["Paris", "paris", "PARIS", "The capital is Paris"]
}
```

### Rubric-Based: Quality Scale
```json
{
  "question": "Summarize this article",
  "rubric": {
    "1": "Poor summary, missing key points",
    "3": "Adequate summary, covers main points",
    "5": "Excellent summary, concise and comprehensive"
  }
}
```

### Human Preference: Comparison Rankings
```json
{
  "question": "Which answer is better?",
  "answer_a": "Paris is the capital of France.",
  "answer_b": "The capital of France is Paris, a city of 2.1 million people.",
  "preference": "B",
  "reasoning": "More informative"
}
```

---

## Creating Ground Truth

### Manual Annotation (Humans Label)
```
Process:
1. Collect examples (questions, documents, images)
2. Human annotators label each
3. Quality control (review annotations)
4. Store in dataset
```

### Expert Review (For Specialized Domains)
```
Medical: Doctors annotate
Legal: Lawyers annotate
Technical: Engineers annotate

Higher quality but more expensive
```

### Crowdsourcing (Amazon MTurk)
```
Pros:
- Fast (many workers)
- Cheap ($0.10-1.00 per annotation)

Cons:
- Variable quality
- Need quality control
```

### Synthetic Generation (For Some Tasks)
```
LLM-generated questions + answers
Careful validation needed
Good for scale, risky for quality
Use for augmentation, not sole source
```

---

## Ground Truth Dataset Structure

### Input (Question, Document, Image)
```json
{
  "input": {
    "type": "question",
    "text": "What is the capital of France?"
  }
}
```

### Expected Output (Answer, Label, Summary)
```json
{
  "expected_output": {
    "type": "answer",
    "text": "Paris",
    "acceptable_variants": ["paris", "PARIS"]
  }
}
```

### Metadata (Difficulty, Category, Source)
```json
{
  "metadata": {
    "difficulty": "easy",
    "category": "geography",
    "source": "wikipedia",
    "language": "en"
  }
}
```

### Annotation Info (Who, When, Confidence)
```json
{
  "annotation": {
    "annotator_id": "annotator_123",
    "timestamp": "2024-01-15T10:00:00Z",
    "confidence": 0.95,
    "time_spent_seconds": 30
  }
}
```

**Complete Example:**
```json
{
  "id": "example_001",
  "input": {
    "type": "question",
    "text": "What is the capital of France?"
  },
  "expected_output": {
    "type": "answer",
    "text": "Paris",
    "acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
  },
  "metadata": {
    "difficulty": "easy",
    "category": "geography",
    "source": "wikipedia",
    "language": "en"
  },
  "annotation": {
    "annotator_id": "annotator_123",
    "timestamp": "2024-01-15T10:00:00Z",
    "confidence": 0.95
  }
}
```

---

## Annotation Guidelines

### Clear Instructions
```markdown
# Annotation Guidelines

## Task
Label whether the answer is correct.

## Instructions
1. Read the question carefully
2. Read the answer
3. Determine if answer is factually correct
4. Mark as "Correct" or "Incorrect"

## Examples
Question: "What is 2+2?"
Answer: "4"
Label: Correct

Question: "What is 2+2?"
Answer: "5"
Label: Incorrect
```

### Examples (Good and Bad)
```markdown
## Good Example
Question: "What is the capital of France?"
Answer: "Paris"
Label: Correct
Reasoning: Factually accurate and directly answers question

## Bad Example
Question: "What is the capital of France?"
Answer: "France is a country in Europe"
Label: Incorrect
Reasoning: Doesn't answer the question
```

### Edge Case Handling
```markdown
## Edge Cases

### Partially Correct
Question: "What are the capitals of France and Germany?"
Answer: "Paris"
Label: Partially Correct (missing Germany)

### Ambiguous Question
Question: "What is the best programming language?"
Label: N/A - Subjective question, no single correct answer

### No Answer in Context
Question: "What is the population of Paris?"
Context: "Paris is the capital of France."
Label: "Cannot be determined from context"
```

### Consistency Checks
```markdown
## Consistency Rules

1. Same question → Same answer
2. Synonyms are acceptable ("car" = "automobile")
3. Case-insensitive ("Paris" = "paris")
4. Extra details are OK ("Paris" vs "Paris, France")
```

---

## Quality Control

### Multiple Annotators Per Example
```
Each example labeled by 3 annotators
Majority vote determines final label
Catches individual annotator errors
```

### Inter-Annotator Agreement (IAA)
```
Measure: Do annotators agree?
Metric: Cohen's Kappa (κ)
Target: κ > 0.7 (good agreement)
```

### Gold Standard Subset (Known Answers)
```
10% of examples have known correct labels
Mix into annotation tasks
Measure annotator accuracy on gold standard
Remove low-quality annotators
```

### Spot Checks by Experts
```
Expert reviews 10% of annotations
Validates quality
Identifies systematic errors
```

---

## Inter-Annotator Agreement

### Kappa Score (Cohen's κ)
```python
from sklearn.metrics import cohen_kappa_score

annotator1 = [1, 0, 1, 1, 0]  # Labels from annotator 1
annotator2 = [1, 0, 1, 0, 0]  # Labels from annotator 2

kappa = cohen_kappa_score(annotator1, annotator2)
print(f"Kappa: {kappa:.2f}")

# Interpretation:
# κ < 0.4: Poor agreement
# κ 0.4-0.6: Moderate agreement
# κ 0.6-0.8: Good agreement
# κ > 0.8: Excellent agreement
```

### Fleiss' κ (Multiple Annotators)
```python
from statsmodels.stats.inter_rater import fleiss_kappa

# 3 annotators, 5 examples
# Each row: [count_label_0, count_label_1]
data = [
    [0, 3],  # Example 1: All 3 annotators chose label 1
    [1, 2],  # Example 2: 1 chose 0, 2 chose 1
    [3, 0],  # Example 3: All 3 chose label 0
    [2, 1],  # Example 4: 2 chose 0, 1 chose 1
    [0, 3],  # Example 5: All 3 chose label 1
]

kappa = fleiss_kappa(data)
print(f"Fleiss' Kappa: {kappa:.2f}")
```

### Percentage Agreement
```python
def percentage_agreement(annotator1, annotator2):
    agreements = sum(a == b for a, b in zip(annotator1, annotator2))
    total = len(annotator1)
    return agreements / total

agreement = percentage_agreement(annotator1, annotator2)
print(f"Agreement: {agreement:.1%}")
```

### Target: >0.7 (Good Agreement)
```
If κ < 0.7:
1. Review annotation guidelines (unclear?)
2. Provide more examples
3. Train annotators
4. Simplify task
```

---

## Resolving Disagreements

### Majority Vote
```python
def majority_vote(labels):
    from collections import Counter
    counts = Counter(labels)
    majority_label = counts.most_common(1)[0][0]
    return majority_label

# 3 annotators
labels = [1, 1, 0]  # Two say 1, one says 0
final_label = majority_vote(labels)  # 1
```

### Expert Adjudication
```
If no majority (e.g., 1, 0, 2):
→ Expert reviews and decides
```

### Discussion and Consensus
```
Annotators discuss disagreement
Reach consensus
Update guidelines if needed
```

### Update Guidelines
```
If systematic disagreements:
→ Guidelines unclear
→ Update and re-annotate
```

---

## Ground Truth for Different Tasks

### Classification: Category Labels
```json
{
  "text": "This product is amazing!",
  "label": "positive"
}
```

### Q&A: Correct Answers + Acceptable Variants
```json
{
  "question": "What is the capital of France?",
  "answer": "Paris",
  "acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
}
```

### Summarization: Reference Summaries
```json
{
  "document": "Long article text...",
  "reference_summary": "Concise summary of key points"
}
```

### RAG: Question + Context + Answer
```json
{
  "question": "What is the capital of France?",
  "context": "Paris is the capital and largest city of France.",
  "answer": "Paris",
  "relevant_chunks": ["Paris is the capital and largest city of France."]
}
```

### Generation: Multiple Acceptable Outputs
```json
{
  "prompt": "Write a haiku about spring",
  "acceptable_outputs": [
    "Cherry blossoms bloom\nGentle breeze carries petals\nSpring has arrived now",
    "Flowers start to bloom\nBirds sing in the morning light\nSpring is here at last"
  ]
}
```

---

## Dataset Size

### Evaluation Set: 100-1000 Examples (Representative)
```
Purpose: Quick evaluation during development
Size: 100-1000 examples
Quality: High (manually curated)
Coverage: Representative of production
```

### Test Set: 500-5000 Examples (Comprehensive)
```
Purpose: Final evaluation before deployment
Size: 500-5000 examples
Quality: High (gold standard)
Coverage: Comprehensive (all categories, edge cases)
```

### Quality > Quantity
```
Better: 100 high-quality examples
Worse: 1000 low-quality examples
```

### Cover Edge Cases
```
Include:
- Common cases (80%)
- Edge cases (15%)
- Adversarial cases (5%)
```

---

## Dataset Maintenance

### Version Control (Like Code)
```bash
# Git for dataset versioning
git init
git add dataset.jsonl
git commit -m "Initial dataset v1.0"

# Tag versions
git tag v1.0

# Update dataset
git add dataset.jsonl
git commit -m "Added 100 new examples"
git tag v1.1
```

### Regular Updates (New Examples)
```
Monthly: Add 50-100 new examples
Quarterly: Major update (500+ examples)
```

### Remove Outdated Examples
```
Examples that are:
- No longer relevant
- Incorrect (facts changed)
- Duplicates
```

### Track Changes (Changelog)
```markdown
# Dataset Changelog

## v1.2 (2024-02-01)
- Added 100 new examples (geography category)
- Removed 20 outdated examples
- Fixed 5 incorrect labels

## v1.1 (2024-01-01)
- Added 50 new examples (science category)
- Updated annotation guidelines

## v1.0 (2023-12-01)
- Initial release (500 examples)
```

---

## Stratified Sampling

### Balance by Difficulty
```
Easy: 40%
Medium: 40%
Hard: 20%
```

### Balance by Category
```
Geography: 25%
Science: 25%
History: 25%
Math: 25%
```

### Include Edge Cases
```
Common cases: 80%
Edge cases: 15%
Adversarial: 5%
```

### Representative of Production
```
Sample from actual production queries
Ensures dataset matches real usage
```

---

## Synthetic Ground Truth

### LLM-Generated Questions + Answers
```python
def generate_synthetic_qa(document):
    prompt = f"""
    Document: {document}
    
    Generate 5 question-answer pairs based on this document.
    
    Format:
    Q1: [question]
    A1: [answer]
    ...
    """
    
    response = llm.generate(prompt)
    qa_pairs = parse_qa_pairs(response)
    return qa_pairs
```

### Careful Validation Needed
```
LLM-generated data can have:
- Hallucinations
- Incorrect facts
- Biased questions

→ Always validate with humans
```

### Good for Scale, Risky for Quality
```
Pros: Can generate 1000s quickly
Cons: Quality varies, needs validation
```

### Use for Augmentation, Not Sole Source
```
Strategy:
- 80% human-annotated (high quality)
- 20% synthetic (validated)
```

---

## Domain-Specific Ground Truth

### Medical: Expert Annotations
```
Annotators: Licensed doctors
Cost: $50-100 per hour
Quality: Very high
Use case: Medical diagnosis, treatment recommendations
```

### Legal: Lawyer Review
```
Annotators: Licensed lawyers
Cost: $100-300 per hour
Quality: Very high
Use case: Legal document analysis, case law
```

### Technical: Engineer Verification
```
Annotators: Senior engineers
Cost: $50-150 per hour
Quality: High
Use case: Code review, technical Q&A
```

---

## Ground Truth Storage

### JSON/JSONL Files
```jsonl
{"id": "1", "question": "What is 2+2?", "answer": "4"}
{"id": "2", "question": "Capital of France?", "answer": "Paris"}
```

### Database (PostgreSQL, MongoDB)
```sql
CREATE TABLE ground_truth (
  id UUID PRIMARY KEY,
  question TEXT NOT NULL,
  answer TEXT NOT NULL,
  category VARCHAR(50),
  difficulty VARCHAR(20),
  created_at TIMESTAMP DEFAULT NOW()
);
```

### Version Control (Git)
```bash
git add dataset/
git commit -m "Update ground truth dataset"
git push
```

### Cloud Storage (S3 + Versioning)
```bash
# Upload to S3 with versioning
aws s3 cp dataset.jsonl s3://my-bucket/ground-truth/v1.0/dataset.jsonl
aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled
```

---

## Ground Truth for RAG

**Structure:**
```json
{
  "question": "What is the capital of France?",
  "expected_answer": "Paris",
  "relevant_document_chunks": [
    "Paris is the capital and largest city of France."
  ],
  "evaluation_criteria": {
    "faithfulness": "Answer must be grounded in context",
    "relevance": "Answer must directly address question",
    "completeness": "Answer should mention Paris"
  }
}
```

---

## Evaluation with Ground Truth

### Exact Match Accuracy
```python
def exact_match(predicted, ground_truth):
    return predicted.strip().lower() == ground_truth.strip().lower()

accuracy = sum(exact_match(p, gt) for p, gt in zip(predicted, ground_truth)) / len(predicted)
```

### F1 Score (For Overlapping Spans)
```python
def f1_score(predicted, ground_truth):
    pred_tokens = set(predicted.lower().split())
    gt_tokens = set(ground_truth.lower().split())
    
    common = pred_tokens & gt_tokens
    if len(pred_tokens) == 0 or len(gt_tokens) == 0:
        return 0
    
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(gt_tokens)
    
    if precision + recall == 0:
        return 0
    
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1
```

### BLEU/ROUGE (For Generation)
```python
from nltk.translate.bleu_score import sentence_bleu

reference = [["Paris", "is", "the", "capital"]]
candidate = ["Paris", "is", "the", "capital"]

bleu = sentence_bleu(reference, candidate)
```

### Semantic Similarity (Embedding Distance)
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

emb1 = model.encode("Paris is the capital of France")
emb2 = model.encode("The capital of France is Paris")

similarity = cosine_similarity([emb1], [emb2])[0][0]
```

---

## Continuous Ground Truth

### Production Feedback (User Thumbs Up/Down)
```python
# Log user feedback
feedback = {
    "question": "What is the capital of France?",
    "answer": "Paris",
    "user_feedback": "thumbs_up",
    "timestamp": "2024-01-15T10:00:00Z"
}

# Add to ground truth if positive
if feedback["user_feedback"] == "thumbs_up":
    add_to_ground_truth(feedback["question"], feedback["answer"])
```

### Human Review of Flagged Outputs
```
User flags answer as incorrect
→ Human reviews
→ If incorrect, add correct answer to ground truth
→ If correct, keep as is
```

### Incrementally Add to Dataset
```
Monthly: Review 100 flagged examples
Add 50 to ground truth
Update dataset version
```

---

## Tools

### Annotation: Label Studio, Prodigy, CVAT

**Label Studio:**
```bash
pip install label-studio
label-studio start
# Open http://localhost:8080
```

**Prodigy:**
```bash
pip install prodigy
prodigy textcat.manual dataset_name source.jsonl --label positive,negative
```

### Management: DVC (Data Version Control)
```bash
pip install dvc
dvc init
dvc add dataset.jsonl
git add dataset.jsonl.dvc .gitignore
git commit -m "Add dataset"
dvc push
```

### Storage: S3, GCS, Local Files

See "Ground Truth Storage" section

---

## Summary

**Ground Truth:** Correct answers for evaluation

**Why:**
- Measure accuracy objectively
- Train/validate models
- Regression testing
- Benchmarking

**Types:**
- Exact match
- Multiple acceptable answers
- Rubric-based
- Human preference

**Creating:**
- Manual annotation
- Expert review
- Crowdsourcing
- Synthetic (with validation)

**Quality Control:**
- Multiple annotators
- Inter-annotator agreement (κ > 0.7)
- Gold standard subset
- Expert spot checks

**Dataset Size:**
- Eval: 100-1000 (representative)
- Test: 500-5000 (comprehensive)
- Quality > quantity

**Maintenance:**
- Version control (Git)
- Regular updates
- Remove outdated
- Changelog

**Tools:**
- Annotation: Label Studio, Prodigy
- Management: DVC
- Storage: S3, GCS, Git

Related Skills

data-management

16
from diegosouzapw/awesome-omni-skill

Comprehensive DataFrame loading, filtering, transformation, and data pipeline management from Excel, CSV, and multiple sources with YAML-driven configuration.

composer-dependency-management

16
from diegosouzapw/awesome-omni-skill

Rules pertaining to Composer dependency management, promoting best practices for declaring and updating dependencies.

claude-config-management

16
from diegosouzapw/awesome-omni-skill

Claude Code設定(リポジトリルート)の構成管理ガイド。ファイルレベルsymlinkによる設定管理、管理対象の追加・削除、Taskfileタスクの実行方法を提供する。「設定ファイルを追加して」「新しいスキルを追加して」「symlinkの状態を確認して」「Claude設定を変更して」のようにClaude Code設定の構成変更を行うときに使用する。

ck:project-management

16
from diegosouzapw/awesome-omni-skill

Track progress, update plan statuses, manage Claude Tasks, generate reports, coordinate docs updates. Use for project oversight, status checks, plan completion, task hydration, cross-session continuity.

agentpmt-tool-file-management-d789ed

16
from diegosouzapw/awesome-omni-skill

Use AgentPMT external API to run the File Management tool with wallet signatures, credits purchase, or credits earned from jobs.

advanced-file-management

16
from diegosouzapw/awesome-omni-skill

Advanced file management tools. Includes batch folder creation, batch file moving, file listing, and HTML author extraction.

1k-state-management

16
from diegosouzapw/awesome-omni-skill

Jotai state management patterns for OneKey. Use when working with atoms, global state, feature state, or context atoms. Triggers on jotai, atom, state, globalAtom, contextAtom, store, persistence, settings.

ads-management

16
from diegosouzapw/awesome-omni-skill

Activate for paid advertising campaigns on Google Ads, Meta Ads, LinkedIn Ads, TikTok Ads. Includes ad copywriting, audience targeting, budget optimization, A/B testing, and ROAS tracking. Used by ads-specialist and campaign-manager agents.

kanban-management

16
from diegosouzapw/awesome-omni-skill

Manages the Anubis Issue Tracker GitHub project board. Use when you need to organize issues by difficulty/status, move issues through workflow stages, or generate board status reports.

GroundEffect

16
from diegosouzapw/awesome-omni-skill

Use this skill when the user asks about email, calendar, or Gmail/Google Calendar management via GroundEffect CLI. Triggers include "search my email", "list recent emails", "check my calendar", "what's on my calendar", "show my meetings", "calendar tomorrow", "calendar next week", "create a calendar event", "manage groundeffect accounts", "sync status", "start the daemon", "groundeffect", "groundeffect command", "draft email", "save draft", "save as draft", "create draft", "send email".

github-release-management

16
from diegosouzapw/awesome-omni-skill

Comprehensive GitHub release orchestration with AI swarm coordination for automated versioning, testing, deployment, and rollback management

amia-github-thread-management

16
from diegosouzapw/awesome-omni-skill

Use when managing PR review threads. Reply does NOT auto-resolve threads. Trigger with /manage-threads.