data-designer

Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

data-designer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using data-designer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/data-designer/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/data-designer/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/data-designer/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How data-designer Compares

Feature / Agent	data-designer	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Data Designer

Generate synthetic datasets combining statistical samplers with Claude's LLM capabilities. No external API keys required.

## Workflow

1. **Clarify requirements** - Ask about purpose, columns, size, format
2. **Create schema** - Write `dataset_schema.json` defining columns
3. **Generate preview** - Run `batch_generator.py` for 3-5 rows
4. **Iterate** - Refine based on feedback
5. **Generate full dataset** - Batch generate, then merge
6. **Deliver** - Export to requested format

## Column Types

### Statistical Samplers (No LLM)

| Type | Description | Key Params |
|------|-------------|------------|
| `category` | Weighted random choice | `values`, `weights` |
| `subcategory` | Hierarchical (parent-based) | `mapping`, `category` |
| `uniform` | Uniform distribution | `low`, `high`, `dtype` |
| `gaussian` | Normal distribution | `mean`, `std`, `min_val`, `max_val` |
| `bernoulli` | Binary probability | `p`, `true_value`, `false_value` |
| `poisson` | Poisson distribution | `mean` |
| `datetime` | Random dates | `start`, `end`, `format` |
| `person` | Synthetic personas | `fields`, `age_range`, `locale` |
| `uuid` | Unique IDs | `prefix`, `format` |

### LLM Columns (Claude generates)

| Type | Description |
|------|-------------|
| `llm_text` | Free-form text |
| `llm_code` | Code with syntax validation |
| `llm_structured` | JSON matching schema |
| `llm_judge` | Quality scoring |

## Schema Format

Create `dataset_schema.json`:

```json
{
  "name": "dataset_name",
  "seed": 42,
  "columns": [
    {"name": "category", "type": "category", "params": {"values": ["A","B"], "weights": [0.6,0.4]}},
    {"name": "text", "type": "llm_text", "prompt": "Write about {{ category }}.", "depends_on": ["category"]}
  ],
  "output": {"format": "csv", "filename": "output"}
}
```

For full schema reference: [references/schema.md](references/schema.md)

## Jinja2 Templating

Reference columns in prompts:

```
Write a {{ rating }}-star review for {{ product_name }} by {{ customer.first_name }}.
```

Supports: `{{ var }}`, `{{ obj.field }}`, `{% if %}`, filters

## Scripts

### Generate Data

```bash
# Preview
python scripts/batch_generator.py --schema schema.json --rows 5 --output preview.json --preview

# Full generation
python scripts/batch_generator.py --schema schema.json --rows 100 --batch-size 20 --output batches/
```

### Merge & Export

```bash
python scripts/merger.py --input batches/ --output dataset.csv --flatten
```

Formats: `csv`, `json`, `jsonl`, `parquet`

## Generation Strategy

1. **Sampler columns first** - Python scripts, fast
2. **LLM columns in dependency order** - Topological sort by `depends_on`
3. **Batch processing** - Generate in batches of 20-50 for large datasets

For LLM columns, Claude generates directly:
- Render Jinja2 prompt with row data
- Generate content
- Validate if configured
- Retry on failure (max 3)

## Examples

**Simple:**
> "Generate 50 product reviews with ratings 1-5"

**Complex:**
> "Create 200 support tickets with: ticket_id (UUID), customer (name, email), category (billing/technical/general), priority (1-5 gaussian), description (LLM)"

**Code:**
> "Generate 100 Python functions with description, code (validated), tests"

## Tips

- Use `seed` for reproducibility
- Preview first, then scale
- Keep LLM prompts specific
- Use `subcategory` for correlated data

## Attribution

Adapted from [NVIDIA NeMo DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (Apache 2.0).

Related Skills

large-data-with-dask

from diegosouzapw/awesome-omni-skill

Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.

ipdata-co-automation

from diegosouzapw/awesome-omni-skill

Automate Ipdata co tasks via Rube MCP (Composio). Always search tools first for current schemas.

gdpr-data-handling

from diegosouzapw/awesome-omni-skill

Implement GDPR-compliant data handling with consent management, data subject rights, and privacy by design. Use when building systems that process EU personal data, implementing privacy controls, o...

fair-data-model-assessment

from diegosouzapw/awesome-omni-skill

Assess data models against FAIR principles using RDA-FDMM indicators. Use when: (1) Evaluating vendor-delivered data models for FAIR compliance, (2) Reviewing schemas, ontologies, or data dictionaries before integration, (3) Creating FAIR assessment reports for data governance reviews, (4) Preparing data model documentation for enterprise or regulatory standards, (5) Auditing existing data assets for FAIRness gaps. Covers 41 RDA indicators across Findable, Accessible, Interoperable, Reusable dimensions with maturity scoring (0-4 scale).

docker-database

from diegosouzapw/awesome-omni-skill

Configure database containers with security, persistence, and health checks

datarobot-automation

from diegosouzapw/awesome-omni-skill

Automate Datarobot tasks via Rube MCP (Composio). Always search tools first for current schemas.

dataql-analysis

from diegosouzapw/awesome-omni-skill

Analyze data files using SQL queries with DataQL. Use when working with CSV, JSON, Parquet, Excel files or when the user mentions data analysis, filtering, aggregation, or SQL queries on files.

datahub-connector-pr-review

from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "review my connector", "check my datahub connector", "review connector code", "audit connector", "review PR", "check code quality", or any request to review/check/audit a DataHub ingestion source. Covers compliance with standards, best practices, testing quality, and merge readiness.

datagma-automation

from diegosouzapw/awesome-omni-skill

Automate Datagma tasks via Rube MCP (Composio). Always search tools first for current schemas.

Database Sync

from diegosouzapw/awesome-omni-skill

Automate database synchronization, replication, migration, and cross-platform data integration

database-skill

from diegosouzapw/awesome-omni-skill

Design and manage relational databases including table creation, migrations, and schema design. Use for database modeling and maintenance.

database-architect

from diegosouzapw/awesome-omni-skill

Database design and optimization specialist. Schema design, query optimization, indexing strategies, data modeling, and migration planning for relational and NoSQL databases.