training-data-curation

Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.

148 stars

bysundial-org

View on GitHub Installation ↓

Best use case

training-data-curation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.

Teams using training-data-curation should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/training-data-curation/SKILL.md --create-dirs "https://raw.githubusercontent.com/sundial-org/skills/main/skills/training-data-curation/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/training-data-curation/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How training-data-curation Compares

Feature / Agent	training-data-curation	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Training Data Curation Guidelines

Best practices for gathering and preparing training data for LLM fine-tuning.

## Data Quality Principles

**Quality over quantity.** Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [[1]](#references). Focus on clean, diverse, well-formatted data.

**Garbage in, garbage out.** The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.

**Match the target distribution.** Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.

## Format Requirements

### Supervised Fine-Tuning (SFT)

Use the **messages format** (OpenAI/Anthropic/Tinker standard) [[5]](#references):

```
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

- Each sample is a complete conversation
- Multi-turn: alternate user/assistant messages
- System prompts optional: `{"role": "system", "content": "..."}`
- JSONL format, one sample per line

### Preference Learning (DPO/ORPO/KTO)

Requires **paired comparisons** [[2]](#references):

```
{"prompt": "...", "chosen": "...", "rejected": "..."}
```

- `chosen` and `rejected` must respond to the same prompt
- Quality difference should be clear and consistent
- Annotator agreement >70% indicates usable samples [[1]](#references)

For KTO, pairs aren't required—just binary labels on completions [[7]](#references):
```
{"prompt": "...", "completion": "...", "label": true/false}
```

### Reward Modeling (RLHF)

Needs **ranked responses** [[1]](#references):

```
{"prompt": "...", "responses": ["best", "second", "worst"]}
```

## Quality Checklist

Before training, verify:

- [ ] **No duplicates** — exact and near-duplicate removal [[3]](#references)
- [ ] **No empty fields** — all required fields populated
- [ ] **Consistent format** — schema matches throughout
- [ ] **Appropriate length** — not too short (noise) or too long (truncation)
- [ ] **Clean text** — proper encoding, no HTML/boilerplate artifacts [[8]](#references)
- [ ] **Manual inspection** — reviewed random sample of 50-100 examples
- [ ] **No PII/sensitive data** — unless intentionally included
- [ ] **License verified** — legal to use for training

## Common Quality Issues

| Issue | Detection | Fix | Source |
|-------|-----------|-----|--------|
| Duplicates | Hash-based dedup | Remove exact matches, MinHash for near-dupes | [[3]](#references) |
| Boilerplate | Keyword filter | Remove "subscribe", "cookie policy", etc. | [[8]](#references) |
| Repetitive text | N-gram analysis | Flag if <30% unique trigrams | [[4]](#references) |
| Low-quality text | Alpha ratio | Remove if <50% alphabetic characters | [[8]](#references) |
| Wrong language | Language detection | fastText classifier, filter to target | [[3]](#references) |
| Too short | Length check | Minimum 3-5 sentences, 100+ words for documents | [[8]](#references) |

## Data Sources

**High quality:**
- Curated human annotations [[1]](#references)
- Expert-written examples
- Filtered high-quality web data [[3]](#references)

**Medium quality:**
- Synthetic data from stronger models (distillation)
- Community Q&A with voting signals
- Filtered user-generated content

**Use with caution:**
- Raw web scrapes
- Unfiltered synthetic data
- Data without clear provenance [[6]](#references)

## Sizing Guidelines

| Dataset Size | Use Case | Source |
|--------------|----------|--------|
| 100-1K | Quick experiments, specific behaviors | — |
| 1K-10K | Production SFT, domain adaptation | — |
| 10K-100K | Comprehensive instruction tuning | [[1]](#references) |
| 1M+ preference pairs | Large-scale RLHF | [[1]](#references) |

Llama 2 used ~27K SFT examples and 1M+ preference comparisons [[1]](#references).

## File Format

- **JSONL** — one JSON object per line, human-readable
- **Parquet** — efficient for large datasets, built-in compression [[3]](#references)
- **Sharding** — split files >500MB into chunks

## References

1. [Llama 2 Paper](https://arxiv.org/abs/2307.09288) — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
2. [TRL Library](https://huggingface.co/docs/trl/) — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
3. [FineWeb Paper](https://arxiv.org/abs/2406.17557) — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
4. [Data-Juicer](https://github.com/alibaba/data-juicer) — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
5. [Tinker API](https://tinker-docs.thinkingmachines.ai/) — Training API using messages format for SFT, DPO/RLHF support
6. [Data Provenance Initiative](https://arxiv.org/abs/2310.16787) — Longpre et al. (2023). Dataset licensing and attribution audit
7. [KTO Paper](https://arxiv.org/abs/2402.01306) — Ethayarajh et al. (2024). Binary preference learning without pairs
8. [C4/T5 Paper](https://arxiv.org/abs/1910.10683) — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal

Related Skills

tinker-training-cost

148

from sundial-org/skills

Calculate training costs for Tinker fine-tuning jobs. Use when estimating costs for Tinker LLM training, counting tokens in datasets, or comparing Tinker model training prices. Tokenizes datasets using the correct model tokenizer and provides accurate cost estimates.

cs448b-visualization

148

from sundial-org/skills

Data visualization design based on Stanford CS448B. Use for: (1) choosing chart types, (2) selecting visual encodings, (3) critiquing visualizations, (4) building D3.js visualizations, (5) designing interactions/animations, (6) choosing colors, (7) visualizing networks, (8) visualizing text. Covers Bertin, Mackinlay, Cleveland & McGill.

tinker

148

from sundial-org/skills

Fine-tune LLMs using the Tinker API. Covers supervised fine-tuning, reinforcement learning, LoRA training, vision-language models, and both high-level Cookbook patterns and low-level API usage.

skill

148

from sundial-org/skills

Find, install, create, improve, and publish AI agent skills through the Sundial ecosystem. Use when the user wants to find or search for skills, install a skill, create a new skill, improve or evaluate an existing skill, or publish a skill to Sundial Hub. Trigger phrases include "find a skill", "install skill", "create a skill", "make a skill", "improve this skill", "evaluate skill", "publish skill", "push skill", "search for skills".

skill-to-card

148

from sundial-org/skills

End-to-end workflow that creates a skill from a description and attached files, publishes it to Sundial as a private skill, generates a trading card (front + back with QR code), and sends it to a printer. Use when the user wants to create a skill and get a printed trading card, or says "skill to card", "create and print a skill card", "make me a skill with a card".

project-referee

148

from sundial-org/skills

Critiques ML conference papers with reviewer-style feedback. Use when users want to anticipate reviewer concerns, identify weaknesses, check claim-evidence gaps, or find missing citations.

neuro-symbolic-reasoning

148

from sundial-org/skills

Neuro-symbolic AI combining LLMs with symbolic solvers. Use when exploring neuro-symbolic approaches (ideation, no code) or implementing solver integrations (code).

icml-reviewer

148

from sundial-org/skills

Paper reviewer that evaluates machine learning research projects following official ICML reviewer guidelines. Provides comprehensive reviews with actionable feedback across all key dimensions: claims/evidence, relation to prior work, originality, significance, clarity, and reproducibility. Also provides formative feedback on incomplete drafts, proposals, and research code repositories. MANDATORY TRIGGERS: review paper, ICML review, paper review, evaluate paper, research paper feedback, ML paper review, conference review, academic review, paper critique, NeurIPS review, ICLR review, project proposal, research proposal, paper draft, early feedback, incomplete paper, work in progress, WIP review, review repo, review codebase, research project review

cs-research-methodology

148

from sundial-org/skills

Conduct a literature review and develop a CS research proposal. Use when asked to review a research area, find gaps in existing work, and propose a novel research contribution. The output is a research proposal identifying an assumption to challenge (the "bit flip") and how to validate it.

commit-splitter

148

from sundial-org/skills

Split large sets of uncommitted changes into logical, well-organized commits. Use when the user has many uncommitted changes and wants structured commits, or proactively suggest when detecting a large diff that would benefit from splitting.

codex

148

from sundial-org/skills

Run OpenAI's Codex CLI agent in non-interactive mode using `codex exec`. Use when delegating coding tasks to Codex, running Codex in scripts/automation, or when needing a second agent to work on a task in parallel.

ai-co-scientist

148

from sundial-org/skills

Transform Claude Code into an AI Scientist that orchestrates research workflows using tree-based hypothesis exploration. Triggers on "research project", "scientific experiment", "run experiments", "AI scientist", "tree search experimentation", "systematic study".