ai-training-data-generation
Generate high-quality training datasets from documents, text corpora, and structured content. Use when creating AI training data from dictionaries, documents, or when generating examples for machine learning models. Optimized for low-resource languages and domain-specific knowledge extraction.
Best use case
ai-training-data-generation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Generate high-quality training datasets from documents, text corpora, and structured content. Use when creating AI training data from dictionaries, documents, or when generating examples for machine learning models. Optimized for low-resource languages and domain-specific knowledge extraction.
Teams using ai-training-data-generation should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/ai-training-data-generation/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How ai-training-data-generation Compares
| Feature / Agent | ai-training-data-generation | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Generate high-quality training datasets from documents, text corpora, and structured content. Use when creating AI training data from dictionaries, documents, or when generating examples for machine learning models. Optimized for low-resource languages and domain-specific knowledge extraction.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# AI Training Data Generation
## Overview
A comprehensive skill for automatically generating high-quality training datasets from documents, text corpora, and structured content. Optimized for low-resource languages, dictionary content, and domain-specific knowledge extraction.
## Capabilities
- **Multi-strategy Generation**: Dictionary pairs, contextual definitions, completion tasks, classification examples
- **Quality Filtering**: Confidence scoring, duplicate removal, and content validation
- **Format Flexibility**: Support for multiple AI training formats (JSONL, HuggingFace, Ollama, OpenAI)
- **Language Awareness**: Multi-language support with special handling for accented characters
- **Scalable Processing**: Generate thousands of examples from large documents
- **Balance Management**: Ensure dataset diversity and prevent category imbalance
## Core Strategies
### 1. Dictionary Pair Extraction
Extract word-definition pairs from structured and semi-structured text.
**Detection Patterns**:
- Separator-based: `word – definition`, `term: meaning`
- Linguistic indicators: `means`, `is defined as`, `refers to`
- Structural cues: Indentation, formatting, list structures
- Context analysis: Surrounding text for validation
### 2. Implementation Pattern
```python
from .ai_training_generator import AITrainingDataGenerator
# Initialize generator
generator = AITrainingDataGenerator(min_confidence=0.7)
# Generate comprehensive training data
training_data = generator.generate_comprehensive_training_data(
parsed_document,
target_count=10000
)
# Export in multiple formats
files = generator.export_training_data(
training_data,
output_dir="training_output",
format_type="ollama"
)
```
## Output Format Examples
### JSONL Format (Standard)
```json
{"input": "What does 'ááfengen' mean?", "output": "very good, excellent", "type": "dictionary_pair", "confidence": 0.95}
```
### Ollama Format
```json
{"prompt": "Translate this Chuukese word: ngang", "response": "fish", "system": "You are a Chuukese-English translator."}
```
### HuggingFace Format
```json
{"text": "### Instruction:\nWhat does 'chomong' mean in Chuukese?\n\n### Response:\nto help, assist"}
```
### OpenAI Fine-tuning Format
```json
{"messages": [{"role": "user", "content": "Define: kúún"}, {"role": "assistant", "content": "to go, to leave"}]}
```
## Quality Assurance
- **Content validity**: Does the example make linguistic sense?
- **Pattern matching**: Does it follow expected language patterns?
- **Context appropriateness**: Is the context relevant and helpful?
- **Uniqueness**: Avoid repetitive or duplicate content
## Best Practices
1. **Multiple validation passes**: Automated and manual quality checks
2. **Confidence thresholds**: Adjust based on use case requirements
3. **Human review sampling**: Periodic manual validation of generated examples
4. **Balance management**: Ensure even distribution across categories
## Dependencies
- `re`: Regular expression pattern matching
- `json`: Data serialization and export
- `hashlib`: Duplicate detection and content hashing
- `collections`: Data structure utilities and countingRelated Skills
when-training-neural-networks-use-flow-nexus-neural
This SOP provides a systematic workflow for training and deploying neural networks using Flow Nexus platform with distributed E2B sandboxes. It covers architecture selection, distributed training, ...
training-hub
Fine-tune LLMs using Red Hat training-hub library with SFT, LoRA, and OSFT algorithms. Use when preparing JSONL datasets, running training jobs, configuring hardware, scaling to clusters, evaluating models, or deploying with vLLM.
audiocraft-audio-generation
PyTorch library for audio generation including text-to-music (MusicGen) and text-to-sound (AudioGen). Use when you need to generate music from text descriptions, create sound effects, or perform melody-conditioned music generation.
atft-training
Run and monitor ATFT-GAT-FAN training loops, hyper-parameter sweeps, and safety modes on A100 GPUs.
relational-database-web-cloudbase
Use when building frontend Web apps that talk to CloudBase Relational Database via @cloudbase/js-sdk – provides the canonical init pattern so you can then use Supabase-style queries from the browser.
Admin and Seed Data
Manage database seeding, reset operations, and the admin interface.
documentation-generation-doc-generate
You are a documentation expert specializing in creating comprehensive, maintainable documentation from code. Generate API docs, architecture diagrams, user guides, and technical references using AI...
lead-generation
Finds and qualifies B2B leads from X/Twitter conversations using keyword search, profile analysis, and intent scoring. Combines MCP tools for automated prospecting pipelines. Use when prospecting, finding potential customers, or mining social conversations for leads.
julien-infra-hostinger-database
Manage shared database instances on Hostinger VPS srv759970 - PostgreSQL, Redis, MongoDB operations. Use for database connections, backups, user management, performance checks, or troubleshooting database issues.
database-migrations-migration-observability
Migration monitoring, CDC, and observability infrastructure
database-cloud-optimization-cost-optimize
You are a cloud cost optimization expert specializing in reducing infrastructure expenses while maintaining performance and reliability. Analyze cloud spending, identify savings opportunities, and ...
database-admin
Expert database administrator specializing in modern cloud