Ground Truth Management
Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning
Best use case
Ground Truth Management is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning
Teams using Ground Truth Management should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/ground-truth-management-majiayu000/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Ground Truth Management Compares
| Feature / Agent | Ground Truth Management | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Ground Truth Management
## What is Ground Truth?
**Definition:** Correct answers for evaluation - human-verified data that serves as the gold standard for measuring AI performance.
### Example
```
Question: "What is the capital of France?"
Ground Truth: "Paris"
AI Answer: "Paris" → Correct ✓
AI Answer: "Lyon" → Incorrect ✗
```
---
## Why Ground Truth Matters
### Measure Accuracy Objectively
```
Without ground truth: "This answer seems good" (subjective)
With ground truth: "Accuracy: 85%" (objective)
```
### Train and Validate Models
```
Training: Learn from ground truth examples
Validation: Measure performance on ground truth test set
```
### Regression Testing
```
Before change: Accuracy 90%
After change: Accuracy 85%
→ Regression detected!
```
### Benchmarking
```
Model A: 90% accuracy on ground truth
Model B: 85% accuracy on ground truth
→ Model A is better
```
---
## Types of Ground Truth
### Exact Match: Single Correct Answer
```json
{
"question": "What is 2+2?",
"answer": "4"
}
```
### Multiple Acceptable Answers
```json
{
"question": "What is the capital of France?",
"acceptable_answers": ["Paris", "paris", "PARIS", "The capital is Paris"]
}
```
### Rubric-Based: Quality Scale
```json
{
"question": "Summarize this article",
"rubric": {
"1": "Poor summary, missing key points",
"3": "Adequate summary, covers main points",
"5": "Excellent summary, concise and comprehensive"
}
}
```
### Human Preference: Comparison Rankings
```json
{
"question": "Which answer is better?",
"answer_a": "Paris is the capital of France.",
"answer_b": "The capital of France is Paris, a city of 2.1 million people.",
"preference": "B",
"reasoning": "More informative"
}
```
---
## Creating Ground Truth
### Manual Annotation (Humans Label)
```
Process:
1. Collect examples (questions, documents, images)
2. Human annotators label each
3. Quality control (review annotations)
4. Store in dataset
```
### Expert Review (For Specialized Domains)
```
Medical: Doctors annotate
Legal: Lawyers annotate
Technical: Engineers annotate
Higher quality but more expensive
```
### Crowdsourcing (Amazon MTurk)
```
Pros:
- Fast (many workers)
- Cheap ($0.10-1.00 per annotation)
Cons:
- Variable quality
- Need quality control
```
### Synthetic Generation (For Some Tasks)
```
LLM-generated questions + answers
Careful validation needed
Good for scale, risky for quality
Use for augmentation, not sole source
```
---
## Ground Truth Dataset Structure
### Input (Question, Document, Image)
```json
{
"input": {
"type": "question",
"text": "What is the capital of France?"
}
}
```
### Expected Output (Answer, Label, Summary)
```json
{
"expected_output": {
"type": "answer",
"text": "Paris",
"acceptable_variants": ["paris", "PARIS"]
}
}
```
### Metadata (Difficulty, Category, Source)
```json
{
"metadata": {
"difficulty": "easy",
"category": "geography",
"source": "wikipedia",
"language": "en"
}
}
```
### Annotation Info (Who, When, Confidence)
```json
{
"annotation": {
"annotator_id": "annotator_123",
"timestamp": "2024-01-15T10:00:00Z",
"confidence": 0.95,
"time_spent_seconds": 30
}
}
```
**Complete Example:**
```json
{
"id": "example_001",
"input": {
"type": "question",
"text": "What is the capital of France?"
},
"expected_output": {
"type": "answer",
"text": "Paris",
"acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
},
"metadata": {
"difficulty": "easy",
"category": "geography",
"source": "wikipedia",
"language": "en"
},
"annotation": {
"annotator_id": "annotator_123",
"timestamp": "2024-01-15T10:00:00Z",
"confidence": 0.95
}
}
```
---
## Annotation Guidelines
### Clear Instructions
```markdown
# Annotation Guidelines
## Task
Label whether the answer is correct.
## Instructions
1. Read the question carefully
2. Read the answer
3. Determine if answer is factually correct
4. Mark as "Correct" or "Incorrect"
## Examples
Question: "What is 2+2?"
Answer: "4"
Label: Correct
Question: "What is 2+2?"
Answer: "5"
Label: Incorrect
```
### Examples (Good and Bad)
```markdown
## Good Example
Question: "What is the capital of France?"
Answer: "Paris"
Label: Correct
Reasoning: Factually accurate and directly answers question
## Bad Example
Question: "What is the capital of France?"
Answer: "France is a country in Europe"
Label: Incorrect
Reasoning: Doesn't answer the question
```
### Edge Case Handling
```markdown
## Edge Cases
### Partially Correct
Question: "What are the capitals of France and Germany?"
Answer: "Paris"
Label: Partially Correct (missing Germany)
### Ambiguous Question
Question: "What is the best programming language?"
Label: N/A - Subjective question, no single correct answer
### No Answer in Context
Question: "What is the population of Paris?"
Context: "Paris is the capital of France."
Label: "Cannot be determined from context"
```
### Consistency Checks
```markdown
## Consistency Rules
1. Same question → Same answer
2. Synonyms are acceptable ("car" = "automobile")
3. Case-insensitive ("Paris" = "paris")
4. Extra details are OK ("Paris" vs "Paris, France")
```
---
## Quality Control
### Multiple Annotators Per Example
```
Each example labeled by 3 annotators
Majority vote determines final label
Catches individual annotator errors
```
### Inter-Annotator Agreement (IAA)
```
Measure: Do annotators agree?
Metric: Cohen's Kappa (κ)
Target: κ > 0.7 (good agreement)
```
### Gold Standard Subset (Known Answers)
```
10% of examples have known correct labels
Mix into annotation tasks
Measure annotator accuracy on gold standard
Remove low-quality annotators
```
### Spot Checks by Experts
```
Expert reviews 10% of annotations
Validates quality
Identifies systematic errors
```
---
## Inter-Annotator Agreement
### Kappa Score (Cohen's κ)
```python
from sklearn.metrics import cohen_kappa_score
annotator1 = [1, 0, 1, 1, 0] # Labels from annotator 1
annotator2 = [1, 0, 1, 0, 0] # Labels from annotator 2
kappa = cohen_kappa_score(annotator1, annotator2)
print(f"Kappa: {kappa:.2f}")
# Interpretation:
# κ < 0.4: Poor agreement
# κ 0.4-0.6: Moderate agreement
# κ 0.6-0.8: Good agreement
# κ > 0.8: Excellent agreement
```
### Fleiss' κ (Multiple Annotators)
```python
from statsmodels.stats.inter_rater import fleiss_kappa
# 3 annotators, 5 examples
# Each row: [count_label_0, count_label_1]
data = [
[0, 3], # Example 1: All 3 annotators chose label 1
[1, 2], # Example 2: 1 chose 0, 2 chose 1
[3, 0], # Example 3: All 3 chose label 0
[2, 1], # Example 4: 2 chose 0, 1 chose 1
[0, 3], # Example 5: All 3 chose label 1
]
kappa = fleiss_kappa(data)
print(f"Fleiss' Kappa: {kappa:.2f}")
```
### Percentage Agreement
```python
def percentage_agreement(annotator1, annotator2):
agreements = sum(a == b for a, b in zip(annotator1, annotator2))
total = len(annotator1)
return agreements / total
agreement = percentage_agreement(annotator1, annotator2)
print(f"Agreement: {agreement:.1%}")
```
### Target: >0.7 (Good Agreement)
```
If κ < 0.7:
1. Review annotation guidelines (unclear?)
2. Provide more examples
3. Train annotators
4. Simplify task
```
---
## Resolving Disagreements
### Majority Vote
```python
def majority_vote(labels):
from collections import Counter
counts = Counter(labels)
majority_label = counts.most_common(1)[0][0]
return majority_label
# 3 annotators
labels = [1, 1, 0] # Two say 1, one says 0
final_label = majority_vote(labels) # 1
```
### Expert Adjudication
```
If no majority (e.g., 1, 0, 2):
→ Expert reviews and decides
```
### Discussion and Consensus
```
Annotators discuss disagreement
Reach consensus
Update guidelines if needed
```
### Update Guidelines
```
If systematic disagreements:
→ Guidelines unclear
→ Update and re-annotate
```
---
## Ground Truth for Different Tasks
### Classification: Category Labels
```json
{
"text": "This product is amazing!",
"label": "positive"
}
```
### Q&A: Correct Answers + Acceptable Variants
```json
{
"question": "What is the capital of France?",
"answer": "Paris",
"acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
}
```
### Summarization: Reference Summaries
```json
{
"document": "Long article text...",
"reference_summary": "Concise summary of key points"
}
```
### RAG: Question + Context + Answer
```json
{
"question": "What is the capital of France?",
"context": "Paris is the capital and largest city of France.",
"answer": "Paris",
"relevant_chunks": ["Paris is the capital and largest city of France."]
}
```
### Generation: Multiple Acceptable Outputs
```json
{
"prompt": "Write a haiku about spring",
"acceptable_outputs": [
"Cherry blossoms bloom\nGentle breeze carries petals\nSpring has arrived now",
"Flowers start to bloom\nBirds sing in the morning light\nSpring is here at last"
]
}
```
---
## Dataset Size
### Evaluation Set: 100-1000 Examples (Representative)
```
Purpose: Quick evaluation during development
Size: 100-1000 examples
Quality: High (manually curated)
Coverage: Representative of production
```
### Test Set: 500-5000 Examples (Comprehensive)
```
Purpose: Final evaluation before deployment
Size: 500-5000 examples
Quality: High (gold standard)
Coverage: Comprehensive (all categories, edge cases)
```
### Quality > Quantity
```
Better: 100 high-quality examples
Worse: 1000 low-quality examples
```
### Cover Edge Cases
```
Include:
- Common cases (80%)
- Edge cases (15%)
- Adversarial cases (5%)
```
---
## Dataset Maintenance
### Version Control (Like Code)
```bash
# Git for dataset versioning
git init
git add dataset.jsonl
git commit -m "Initial dataset v1.0"
# Tag versions
git tag v1.0
# Update dataset
git add dataset.jsonl
git commit -m "Added 100 new examples"
git tag v1.1
```
### Regular Updates (New Examples)
```
Monthly: Add 50-100 new examples
Quarterly: Major update (500+ examples)
```
### Remove Outdated Examples
```
Examples that are:
- No longer relevant
- Incorrect (facts changed)
- Duplicates
```
### Track Changes (Changelog)
```markdown
# Dataset Changelog
## v1.2 (2024-02-01)
- Added 100 new examples (geography category)
- Removed 20 outdated examples
- Fixed 5 incorrect labels
## v1.1 (2024-01-01)
- Added 50 new examples (science category)
- Updated annotation guidelines
## v1.0 (2023-12-01)
- Initial release (500 examples)
```
---
## Stratified Sampling
### Balance by Difficulty
```
Easy: 40%
Medium: 40%
Hard: 20%
```
### Balance by Category
```
Geography: 25%
Science: 25%
History: 25%
Math: 25%
```
### Include Edge Cases
```
Common cases: 80%
Edge cases: 15%
Adversarial: 5%
```
### Representative of Production
```
Sample from actual production queries
Ensures dataset matches real usage
```
---
## Synthetic Ground Truth
### LLM-Generated Questions + Answers
```python
def generate_synthetic_qa(document):
prompt = f"""
Document: {document}
Generate 5 question-answer pairs based on this document.
Format:
Q1: [question]
A1: [answer]
...
"""
response = llm.generate(prompt)
qa_pairs = parse_qa_pairs(response)
return qa_pairs
```
### Careful Validation Needed
```
LLM-generated data can have:
- Hallucinations
- Incorrect facts
- Biased questions
→ Always validate with humans
```
### Good for Scale, Risky for Quality
```
Pros: Can generate 1000s quickly
Cons: Quality varies, needs validation
```
### Use for Augmentation, Not Sole Source
```
Strategy:
- 80% human-annotated (high quality)
- 20% synthetic (validated)
```
---
## Domain-Specific Ground Truth
### Medical: Expert Annotations
```
Annotators: Licensed doctors
Cost: $50-100 per hour
Quality: Very high
Use case: Medical diagnosis, treatment recommendations
```
### Legal: Lawyer Review
```
Annotators: Licensed lawyers
Cost: $100-300 per hour
Quality: Very high
Use case: Legal document analysis, case law
```
### Technical: Engineer Verification
```
Annotators: Senior engineers
Cost: $50-150 per hour
Quality: High
Use case: Code review, technical Q&A
```
---
## Ground Truth Storage
### JSON/JSONL Files
```jsonl
{"id": "1", "question": "What is 2+2?", "answer": "4"}
{"id": "2", "question": "Capital of France?", "answer": "Paris"}
```
### Database (PostgreSQL, MongoDB)
```sql
CREATE TABLE ground_truth (
id UUID PRIMARY KEY,
question TEXT NOT NULL,
answer TEXT NOT NULL,
category VARCHAR(50),
difficulty VARCHAR(20),
created_at TIMESTAMP DEFAULT NOW()
);
```
### Version Control (Git)
```bash
git add dataset/
git commit -m "Update ground truth dataset"
git push
```
### Cloud Storage (S3 + Versioning)
```bash
# Upload to S3 with versioning
aws s3 cp dataset.jsonl s3://my-bucket/ground-truth/v1.0/dataset.jsonl
aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled
```
---
## Ground Truth for RAG
**Structure:**
```json
{
"question": "What is the capital of France?",
"expected_answer": "Paris",
"relevant_document_chunks": [
"Paris is the capital and largest city of France."
],
"evaluation_criteria": {
"faithfulness": "Answer must be grounded in context",
"relevance": "Answer must directly address question",
"completeness": "Answer should mention Paris"
}
}
```
---
## Evaluation with Ground Truth
### Exact Match Accuracy
```python
def exact_match(predicted, ground_truth):
return predicted.strip().lower() == ground_truth.strip().lower()
accuracy = sum(exact_match(p, gt) for p, gt in zip(predicted, ground_truth)) / len(predicted)
```
### F1 Score (For Overlapping Spans)
```python
def f1_score(predicted, ground_truth):
pred_tokens = set(predicted.lower().split())
gt_tokens = set(ground_truth.lower().split())
common = pred_tokens & gt_tokens
if len(pred_tokens) == 0 or len(gt_tokens) == 0:
return 0
precision = len(common) / len(pred_tokens)
recall = len(common) / len(gt_tokens)
if precision + recall == 0:
return 0
f1 = 2 * (precision * recall) / (precision + recall)
return f1
```
### BLEU/ROUGE (For Generation)
```python
from nltk.translate.bleu_score import sentence_bleu
reference = [["Paris", "is", "the", "capital"]]
candidate = ["Paris", "is", "the", "capital"]
bleu = sentence_bleu(reference, candidate)
```
### Semantic Similarity (Embedding Distance)
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode("Paris is the capital of France")
emb2 = model.encode("The capital of France is Paris")
similarity = cosine_similarity([emb1], [emb2])[0][0]
```
---
## Continuous Ground Truth
### Production Feedback (User Thumbs Up/Down)
```python
# Log user feedback
feedback = {
"question": "What is the capital of France?",
"answer": "Paris",
"user_feedback": "thumbs_up",
"timestamp": "2024-01-15T10:00:00Z"
}
# Add to ground truth if positive
if feedback["user_feedback"] == "thumbs_up":
add_to_ground_truth(feedback["question"], feedback["answer"])
```
### Human Review of Flagged Outputs
```
User flags answer as incorrect
→ Human reviews
→ If incorrect, add correct answer to ground truth
→ If correct, keep as is
```
### Incrementally Add to Dataset
```
Monthly: Review 100 flagged examples
Add 50 to ground truth
Update dataset version
```
---
## Tools
### Annotation: Label Studio, Prodigy, CVAT
**Label Studio:**
```bash
pip install label-studio
label-studio start
# Open http://localhost:8080
```
**Prodigy:**
```bash
pip install prodigy
prodigy textcat.manual dataset_name source.jsonl --label positive,negative
```
### Management: DVC (Data Version Control)
```bash
pip install dvc
dvc init
dvc add dataset.jsonl
git add dataset.jsonl.dvc .gitignore
git commit -m "Add dataset"
dvc push
```
### Storage: S3, GCS, Local Files
See "Ground Truth Storage" section
---
## Summary
**Ground Truth:** Correct answers for evaluation
**Why:**
- Measure accuracy objectively
- Train/validate models
- Regression testing
- Benchmarking
**Types:**
- Exact match
- Multiple acceptable answers
- Rubric-based
- Human preference
**Creating:**
- Manual annotation
- Expert review
- Crowdsourcing
- Synthetic (with validation)
**Quality Control:**
- Multiple annotators
- Inter-annotator agreement (κ > 0.7)
- Gold standard subset
- Expert spot checks
**Dataset Size:**
- Eval: 100-1000 (representative)
- Test: 500-5000 (comprehensive)
- Quality > quantity
**Maintenance:**
- Version control (Git)
- Regular updates
- Remove outdated
- Changelog
**Tools:**
- Annotation: Label Studio, Prodigy
- Management: DVC
- Storage: S3, GCS, GitRelated Skills
data-management
Comprehensive DataFrame loading, filtering, transformation, and data pipeline management from Excel, CSV, and multiple sources with YAML-driven configuration.
composer-dependency-management
Rules pertaining to Composer dependency management, promoting best practices for declaring and updating dependencies.
claude-config-management
Claude Code設定(リポジトリルート)の構成管理ガイド。ファイルレベルsymlinkによる設定管理、管理対象の追加・削除、Taskfileタスクの実行方法を提供する。「設定ファイルを追加して」「新しいスキルを追加して」「symlinkの状態を確認して」「Claude設定を変更して」のようにClaude Code設定の構成変更を行うときに使用する。
ck:project-management
Track progress, update plan statuses, manage Claude Tasks, generate reports, coordinate docs updates. Use for project oversight, status checks, plan completion, task hydration, cross-session continuity.
agentpmt-tool-file-management-d789ed
Use AgentPMT external API to run the File Management tool with wallet signatures, credits purchase, or credits earned from jobs.
advanced-file-management
Advanced file management tools. Includes batch folder creation, batch file moving, file listing, and HTML author extraction.
1k-state-management
Jotai state management patterns for OneKey. Use when working with atoms, global state, feature state, or context atoms. Triggers on jotai, atom, state, globalAtom, contextAtom, store, persistence, settings.
ads-management
Activate for paid advertising campaigns on Google Ads, Meta Ads, LinkedIn Ads, TikTok Ads. Includes ad copywriting, audience targeting, budget optimization, A/B testing, and ROAS tracking. Used by ads-specialist and campaign-manager agents.
kanban-management
Manages the Anubis Issue Tracker GitHub project board. Use when you need to organize issues by difficulty/status, move issues through workflow stages, or generate board status reports.
GroundEffect
Use this skill when the user asks about email, calendar, or Gmail/Google Calendar management via GroundEffect CLI. Triggers include "search my email", "list recent emails", "check my calendar", "what's on my calendar", "show my meetings", "calendar tomorrow", "calendar next week", "create a calendar event", "manage groundeffect accounts", "sync status", "start the daemon", "groundeffect", "groundeffect command", "draft email", "save draft", "save as draft", "create draft", "send email".
github-release-management
Comprehensive GitHub release orchestration with AI swarm coordination for automated versioning, testing, deployment, and rollback management
amia-github-thread-management
Use when managing PR review threads. Reply does NOT auto-resolve threads. Trigger with /manage-threads.