data-designer

Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline.

1,486 stars

Best use case

data-designer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline.

Teams using data-designer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/data-designer/SKILL.md --create-dirs "https://raw.githubusercontent.com/NVIDIA-NeMo/DataDesigner/main/skills/data-designer/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/data-designer/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How data-designer Compares

Feature / Agentdata-designerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Before You Start

Do not explore the workspace first. The workflow's Learn step gives you everything you need.

# Goal

Build a synthetic dataset using the Data Designer library that matches this description:

$ARGUMENTS

# Workflow

Use **Autopilot** mode if the user implies they don't want to answer questions — e.g., they say something like "be opinionated", "you decide", "make reasonable assumptions", "just build it", "surprise me", etc. Otherwise, use **Interactive** mode (default).

Read **only** the workflow file that matches the selected mode, then follow it:

- **Interactive** → read `workflows/interactive.md`
- **Autopilot** → read `workflows/autopilot.md`

# Rules

- Keep all columns in the output by default. The only exceptions for dropping a column are: (1) the user explicitly asks, or (2) it is a helper column that exists solely to derive other columns (e.g., a sampled person object used to extract name, city, etc.). When in doubt, keep the column.
- Do not suggest or ask about seed datasets. Only use one when the user explicitly provides seed data or asks to build from existing records. When using a seed, read `references/seed-datasets.md`.
- When the dataset requires person data (names, demographics, addresses), read `references/person-sampling.md`.
- If a dataset script that matches the dataset description already exists, ask the user whether to edit it or create a new one.

# Usage Tips and Common Pitfalls

- **Sampler and validation columns need both a type and params.** E.g., `sampler_type="category"` with `params=dd.CategorySamplerParams(...)`.
- **Jinja2 templates** in `prompt`, `system_prompt`, and `expr` fields: reference columns with `{{ column_name }}`, nested fields with `{{ column_name.field }}`.
- **`SamplerColumnConfig`:** Takes `params`, not `sampler_params`.
- **LLM judge score access:** `LLMJudgeColumnConfig` produces a nested dict where each score name maps to `{reasoning: str, score: int}`. To get the numeric score, use the `.score` attribute. For example, for a judge column named `quality` with a score named `correctness`, use `{{ quality.correctness.score }}`. Using `{{ quality.correctness }}` returns the full dict, not the numeric score.

# Troubleshooting

- **`data-designer` CLI not found:** Tell the user that `data-designer` is not installed in this environment (requires Python >= 3.10). Ask if they would like you to create a virtual environment and install it, or if they prefer to do it themselves. Do not install anything without the user's permission.
- **Network errors during preview:** A sandbox environment may be blocking outbound requests. Ask the user for permission to retry the command with the sandbox disabled. Only as a last resort, if retrying outside the sandbox also fails, tell the user to run the command themselves.

# Output Template

Write a Python file to the current directory with a `load_config_builder()` function returning a `DataDesignerConfigBuilder`. Name the file descriptively (e.g., `customer_reviews.py`). Use PEP 723 inline metadata for dependencies.

```python
# /// script
# dependencies = [
#   "data-designer", # always required
#   "pydantic", # only if this script imports from pydantic
#   # add additional dependencies here
# ]
# ///
import data_designer.config as dd
from pydantic import BaseModel, Field


# Use Pydantic models when the output needs to conform to a specific schema
class MyStructuredOutput(BaseModel):
    field_one: str = Field(description="...")
    field_two: int = Field(description="...")


# Use custom generators when built-in column types aren't enough
@dd.custom_column_generator(
    required_columns=["col_a"],
    side_effect_columns=["extra_col"],
)
def generator_function(row: dict) -> dict:
    # add custom logic here that depends on "col_a" and update row in place
    row["name_in_custom_column_config"] = "custom value"
    row["extra_col"] = "extra value"
    return row


def load_config_builder() -> dd.DataDesignerConfigBuilder:
    config_builder = dd.DataDesignerConfigBuilder()

    # Seed dataset (only if the user explicitly mentions a seed dataset path)
    # config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))

    # config_builder.add_column(...)
    # config_builder.add_processor(...)

    return config_builder
```

Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them.

Related Skills

update-pr

1486
from NVIDIA-NeMo/DataDesigner

Update an existing GitHub PR description to reflect current changes after incorporating feedback

search-github

1486
from NVIDIA-NeMo/DataDesigner

Search GitHub issues, discussions, and PRs for content related to a topic

search-docs

1486
from NVIDIA-NeMo/DataDesigner

Search local documentation in the docs/ folder for content related to a topic

review-code

1486
from NVIDIA-NeMo/DataDesigner

Perform a thorough code review of the current branch or a GitHub PR by number.

create-pr

1486
from NVIDIA-NeMo/DataDesigner

Create a GitHub PR with a well-formatted description matching the repository PR template (flat Changes by default; optional Added/Changed/Removed/Fixed grouping)

commit

1486
from NVIDIA-NeMo/DataDesigner

Commit current changes with a clear, descriptive message

data-scraper-agent

144923
from affaan-m/everything-claude-code

构建一个全自动化的AI驱动数据收集代理,适用于任何公共来源——招聘网站、价格信息、新闻、GitHub、体育赛事等任何内容。按计划进行抓取,使用免费LLM(Gemini Flash)丰富数据,将结果存储在Notion/Sheets/Supabase中,并从用户反馈中学习。完全免费在GitHub Actions上运行。适用于用户希望自动监控、收集或跟踪任何公共数据的场景。

Data CollectionClaude

database-migrations

144923
from affaan-m/everything-claude-code

Database migration best practices for schema changes, data migrations, rollbacks, and zero-downtime deployments across PostgreSQL, MySQL, and common ORMs (Prisma, Drizzle, Django, TypeORM, golang-migrate). Use when planning or implementing database schema changes.

DevelopmentClaude

native-data-fetching

31392
from sickn33/antigravity-awesome-skills

Use when implementing or debugging ANY network request, API call, or data fetching. Covers fetch API, React Query, SWR, error handling, caching, offline support, and Expo Router data loaders (useLoaderData).

API IntegrationClaude

hugging-face-datasets

31392
from sickn33/antigravity-awesome-skills

Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.

Data ManagementClaude

hugging-face-dataset-viewer

31392
from sickn33/antigravity-awesome-skills

Query Hugging Face datasets through the Dataset Viewer API for splits, rows, search, filters, and parquet links.

Data Access & ExplorationClaude

gdpr-data-handling

31392
from sickn33/antigravity-awesome-skills

Practical implementation guide for GDPR-compliant data processing, consent management, and privacy controls.

Legal & ComplianceClaude