data-designer
Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).
Best use case
data-designer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).
Teams using data-designer should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/data-designer/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How data-designer Compares
| Feature / Agent | data-designer | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Data Designer
Generate synthetic datasets combining statistical samplers with Claude's LLM capabilities. No external API keys required.
## Workflow
1. **Clarify requirements** - Ask about purpose, columns, size, format
2. **Create schema** - Write `dataset_schema.json` defining columns
3. **Generate preview** - Run `batch_generator.py` for 3-5 rows
4. **Iterate** - Refine based on feedback
5. **Generate full dataset** - Batch generate, then merge
6. **Deliver** - Export to requested format
## Column Types
### Statistical Samplers (No LLM)
| Type | Description | Key Params |
|------|-------------|------------|
| `category` | Weighted random choice | `values`, `weights` |
| `subcategory` | Hierarchical (parent-based) | `mapping`, `category` |
| `uniform` | Uniform distribution | `low`, `high`, `dtype` |
| `gaussian` | Normal distribution | `mean`, `std`, `min_val`, `max_val` |
| `bernoulli` | Binary probability | `p`, `true_value`, `false_value` |
| `poisson` | Poisson distribution | `mean` |
| `datetime` | Random dates | `start`, `end`, `format` |
| `person` | Synthetic personas | `fields`, `age_range`, `locale` |
| `uuid` | Unique IDs | `prefix`, `format` |
### LLM Columns (Claude generates)
| Type | Description |
|------|-------------|
| `llm_text` | Free-form text |
| `llm_code` | Code with syntax validation |
| `llm_structured` | JSON matching schema |
| `llm_judge` | Quality scoring |
## Schema Format
Create `dataset_schema.json`:
```json
{
"name": "dataset_name",
"seed": 42,
"columns": [
{"name": "category", "type": "category", "params": {"values": ["A","B"], "weights": [0.6,0.4]}},
{"name": "text", "type": "llm_text", "prompt": "Write about {{ category }}.", "depends_on": ["category"]}
],
"output": {"format": "csv", "filename": "output"}
}
```
For full schema reference: [references/schema.md](references/schema.md)
## Jinja2 Templating
Reference columns in prompts:
```
Write a {{ rating }}-star review for {{ product_name }} by {{ customer.first_name }}.
```
Supports: `{{ var }}`, `{{ obj.field }}`, `{% if %}`, filters
## Scripts
### Generate Data
```bash
# Preview
python scripts/batch_generator.py --schema schema.json --rows 5 --output preview.json --preview
# Full generation
python scripts/batch_generator.py --schema schema.json --rows 100 --batch-size 20 --output batches/
```
### Merge & Export
```bash
python scripts/merger.py --input batches/ --output dataset.csv --flatten
```
Formats: `csv`, `json`, `jsonl`, `parquet`
## Generation Strategy
1. **Sampler columns first** - Python scripts, fast
2. **LLM columns in dependency order** - Topological sort by `depends_on`
3. **Batch processing** - Generate in batches of 20-50 for large datasets
For LLM columns, Claude generates directly:
- Render Jinja2 prompt with row data
- Generate content
- Validate if configured
- Retry on failure (max 3)
## Examples
**Simple:**
> "Generate 50 product reviews with ratings 1-5"
**Complex:**
> "Create 200 support tickets with: ticket_id (UUID), customer (name, email), category (billing/technical/general), priority (1-5 gaussian), description (LLM)"
**Code:**
> "Generate 100 Python functions with description, code (validated), tests"
## Tips
- Use `seed` for reproducibility
- Preview first, then scale
- Keep LLM prompts specific
- Use `subcategory` for correlated data
## Attribution
Adapted from [NVIDIA NeMo DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (Apache 2.0).Related Skills
large-data-with-dask
Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.
ipdata-co-automation
Automate Ipdata co tasks via Rube MCP (Composio). Always search tools first for current schemas.
gdpr-data-handling
Implement GDPR-compliant data handling with consent management, data subject rights, and privacy by design. Use when building systems that process EU personal data, implementing privacy controls, o...
fair-data-model-assessment
Assess data models against FAIR principles using RDA-FDMM indicators. Use when: (1) Evaluating vendor-delivered data models for FAIR compliance, (2) Reviewing schemas, ontologies, or data dictionaries before integration, (3) Creating FAIR assessment reports for data governance reviews, (4) Preparing data model documentation for enterprise or regulatory standards, (5) Auditing existing data assets for FAIRness gaps. Covers 41 RDA indicators across Findable, Accessible, Interoperable, Reusable dimensions with maturity scoring (0-4 scale).
docker-database
Configure database containers with security, persistence, and health checks
datarobot-automation
Automate Datarobot tasks via Rube MCP (Composio). Always search tools first for current schemas.
dataql-analysis
Analyze data files using SQL queries with DataQL. Use when working with CSV, JSON, Parquet, Excel files or when the user mentions data analysis, filtering, aggregation, or SQL queries on files.
datahub-connector-pr-review
This skill should be used when the user asks to "review my connector", "check my datahub connector", "review connector code", "audit connector", "review PR", "check code quality", or any request to review/check/audit a DataHub ingestion source. Covers compliance with standards, best practices, testing quality, and merge readiness.
datagma-automation
Automate Datagma tasks via Rube MCP (Composio). Always search tools first for current schemas.
Database Sync
Automate database synchronization, replication, migration, and cross-platform data integration
database-skill
Design and manage relational databases including table creation, migrations, and schema design. Use for database modeling and maintenance.
database-architect
Database design and optimization specialist. Schema design, query optimization, indexing strategies, data modeling, and migration planning for relational and NoSQL databases.